Sunday, March 10, 2019

Develop the RESTAPI in pyspark


hdfs -->
---------------
Reading data from hdfs
processing
storing
---------------------
hdfs-->folders/files

REST API--> Representational State Transfer Application Programming Interface
URL-->https://www.google.com
http and https
1.remote server http/https
2.GET PUT POST

hadoop is running in remote location
hadoop-->URLs--->REST API

read and write the data from hdfs -->NN,SNN,DN

NN-->
http request to HDFS(NN)--->webserver(node)(webhdfs)
jntuh.ac.in

curl-->connecting url
webhdfs-->
kerbros
username/pwd
configure

curl -i "http://ip-172-31-35-141.ec2.internal:50070/webhdfs/v1/user/cloudera/calllogdata
?op=OPEN"  -L  ==>GET option
?op=LISTSTATUS
?op=MKDIRS  -X PUT

import os
import request
from pyspark.sql import SparkSession
spark=SparkSession.builder.master('local').getOrCreate()
spark

with open("c:\\users\sample.txt") as inputfile:
    recordsinfile=len(inputfile.readLines())
print("the no of records are {}',format(recordsinfile))

#if the input file is in hdfs
#use webhdfs to connect and read from hdfs

r1=spark.sparkContext.textFile("c://.......//sample.txt")
print(r1.collect())
noofRec=len(r1.collect())
print("the no of records are {}'.format(noofRec))
if recordsinfile==noofRec:
  print("the no of records matched")
else:
  print("some issue while loading the data")

r2=r1.filter(lambda x:'d' in x)
noofrecordswriting=r2.count()

r2.coalesce(1).saveAsTextFile("c://")
with open("c://") as outputfile:
  recordswritten=len(outputfile.readLines())

if noofrecordswriting==recordswritten
  print("the records are written successfully")

  Testing framework in python:
pytest
unit testing

  

No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...