Monday, February 11, 2019

Spark Reading and Writing Different file types

Reading Different file types:
Unstructure File Format:
Input file:
file1.txt
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system

val rdd1=sc.textFile("file:///home/cloudera/file1")
rdd1.collect.foreach(println)

output:
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system


Structure File Formats:
-----------------------
Reading CSV files
Input file-->
file2
1,2,usa
2,3,uk
3,4,australia
4,5,japan


External library from mavenrepositories ->Spark CSV and Spark XML
Launching the shell:
spark-shell package 'com.databricks:spark-csv.2.11:1.5.0'

code:
val df_csv=sqlContext.read.format("csv").load("file:///home/cloudera/file2")
df_csv.show


Reading JSON files
Inputfile-->jsonfile
{"name":"karnakar","course":"spark","country":"India"}
{"name":"Hina","country":"USA","course":"scala"}
{"course":"hadoop","name":"Hina","country":"India"}

code
val df_json=sqlContext.read.format("json").load("file://home/cloudera/jsonfile")
df_json.show
Reading XML Files:
Inputfile-->
xmlfile
<trainee>
   <name>Ramesh</name>
   <course>spark</course>
</trainee>  
<trainee>
   <name>Dinesh</name>
   <course>scala</course>
</trainee> 

code:
val df_xml=sqlContext.read.format("xml").option("rowTag","trainee").load("file:///home/cloudera/xmlfile")
df.xml.show

Tranformation and writing to save:
---------------------------------
rdd1.collect.foreach(println)
val rdd2=rdd1.filter(x=>x.contains("system"))
rdd2.collect.foreach(println)

val df_csv2=df_csv.where("c2='usa'")
df_csv2.show

Writing to Local:
df_csv2.write.format("json").save("file:///home/cloudera/csvoutput")
df_xml.write,format("csv").save("file://home/cloudera/xmloutput")
rdd2.saveAsTextFile("file://home/cloudera/rddoutput")


In Additonal properties:
option("header","true")
option("inferSchema","true")
option("delimeter","\t")







 










No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...