hdfs -->
---------------Reading data from hdfs
processing
storing
---------------------
hdfs-->folders/files
REST API--> Representational State Transfer Application
Programming Interface
URL-->https://www.google.comhttp and https
1.remote server http/https
2.GET PUT POST
hadoop is running in remote location
hadoop-->URLs--->REST API
read and write the data from hdfs -->NN,SNN,DN
NN-->
http request to HDFS(NN)--->webserver(node)(webhdfs)jntuh.ac.in
curl-->connecting url
webhdfs-->kerbros
username/pwd
configure
curl -i
"http://ip-172-31-35-141.ec2.internal:50070/webhdfs/v1/user/cloudera/calllogdata
?op=OPEN"
-L ==>GET option?op=LISTSTATUS
?op=MKDIRS -X PUT
import os
import request
from pyspark.sql import SparkSession
spark=SparkSession.builder.master('local').getOrCreate()import request
from pyspark.sql import SparkSession
spark
with open("c:\\users\sample.txt") as inputfile:
recordsinfile=len(inputfile.readLines())print("the no of records are {}',format(recordsinfile))
#if the input file is in hdfs
#use webhdfs to connect and read from hdfs
r1=spark.sparkContext.textFile("c://.......//sample.txt")
print(r1.collect())noofRec=len(r1.collect())
print("the no of records are {}'.format(noofRec))
if recordsinfile==noofRec:
print("the no of records matched")
else:
print("some issue while loading the data")
r2=r1.filter(lambda x:'d' in x)
noofrecordswriting=r2.count()
r2.coalesce(1).saveAsTextFile("c://")
with open("c://") as outputfile:recordswritten=len(outputfile.readLines())
if noofrecordswriting==recordswritten
print("the
records are written successfully")
Testing framework in python:
pytestunit testing
No comments:
Post a Comment