Bigdata_Spark_Consultant: Spark Reading and Writing Different file types

Monday, February 11, 2019

Spark Reading and Writing Different file types

Reading Different file types:
Unstructure File Format:
Input file:
file1.txt
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system

val rdd1=sc.textFile("file:///home/cloudera/file1")
rdd1.collect.foreach(println)

output:
spark is big big data tech

hadoop is also a bigdata tech

hdfs is a storage system

yarn is computing system

Structure File Formats:
-----------------------
Reading CSV files
Input file-->
file2
1,2,usa
2,3,uk
3,4,australia
4,5,japan

External library from mavenrepositories ->Spark CSV and Spark XML
Launching the shell:
spark-shell package 'com.databricks:spark-csv.2.11:1.5.0'

code:
val df_csv=sqlContext.read.format("csv").load("file:///home/cloudera/file2")
df_csv.show

Reading JSON files
Inputfile-->jsonfile
{"name":"karnakar","course":"spark","country":"India"}
{"name":"Hina","country":"USA","course":"scala"}
{"course":"hadoop","name":"Hina","country":"India"}

code
val df_json=sqlContext.read.format("json").load("file://home/cloudera/jsonfile")
df_json.show
Reading XML Files:
Inputfile-->
xmlfile
<trainee>
   <name>Ramesh</name>
   <course>spark</course>
</trainee>
<trainee>
   <name>Dinesh</name>
   <course>scala</course>
</trainee>

code:
val df_xml=sqlContext.read.format("xml").option("rowTag","trainee").load("file:///home/cloudera/xmlfile")
df.xml.show

Tranformation and writing to save:
---------------------------------
rdd1.collect.foreach(println)
val rdd2=rdd1.filter(x=>x.contains("system"))
rdd2.collect.foreach(println)

val df_csv2=df_csv.where("c2='usa'")
df_csv2.show

Writing to Local:
df_csv2.write.format("json").save("file:///home/cloudera/csvoutput")
df_xml.write,format("csv").save("file://home/cloudera/xmloutput")
rdd2.saveAsTextFile("file://home/cloudera/rddoutput")

In Additonal properties:
option("header","true")
option("inferSchema","true")
option("delimeter","\t")

Bigdata_Spark_Consultant

Monday, February 11, 2019

Spark Reading and Writing Different file types

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels