Reading Different file types:
Unstructure File Format:
Input file:
file1.txt
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system
val rdd1=sc.textFile("file:///home/cloudera/file1")
rdd1.collect.foreach(println)
output:
spark is big big data tech
Structure File Formats:
-----------------------
Reading CSV files
Input file-->
file2
1,2,usa
2,3,uk
3,4,australia
4,5,japan
External library from mavenrepositories ->Spark CSV and Spark XML
Launching the shell:
spark-shell package 'com.databricks:spark-csv.2.11:1.5.0'
code:
val df_csv=sqlContext.read.format("csv").load("file:///home/cloudera/file2")
df_csv.show
Reading JSON files
Inputfile-->jsonfile
{"name":"karnakar","course":"spark","country":"India"}
{"name":"Hina","country":"USA","course":"scala"}
{"course":"hadoop","name":"Hina","country":"India"}
code
val df_json=sqlContext.read.format("json").load("file://home/cloudera/jsonfile")
df_json.show
Reading XML Files:
Inputfile-->
xmlfile
<trainee>
<name>Ramesh</name>
<course>spark</course>
</trainee>
<trainee>
<name>Dinesh</name>
<course>scala</course>
</trainee>
code:
val df_xml=sqlContext.read.format("xml").option("rowTag","trainee").load("file:///home/cloudera/xmlfile")
df.xml.show
Tranformation and writing to save:
---------------------------------
rdd1.collect.foreach(println)
val rdd2=rdd1.filter(x=>x.contains("system"))
rdd2.collect.foreach(println)
val df_csv2=df_csv.where("c2='usa'")
df_csv2.show
Writing to Local:
df_csv2.write.format("json").save("file:///home/cloudera/csvoutput")
df_xml.write,format("csv").save("file://home/cloudera/xmloutput")
rdd2.saveAsTextFile("file://home/cloudera/rddoutput")
In Additonal properties:
option("header","true")
option("inferSchema","true")
option("delimeter","\t")
Unstructure File Format:
Input file:
file1.txt
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system
val rdd1=sc.textFile("file:///home/cloudera/file1")
rdd1.collect.foreach(println)
output:
spark is big big data tech
hadoop is also a bigdata tech
hdfs is a storage system
yarn is computing system
Structure File Formats:
-----------------------
Reading CSV files
Input file-->
file2
1,2,usa
2,3,uk
3,4,australia
4,5,japan
External library from mavenrepositories ->Spark CSV and Spark XML
Launching the shell:
spark-shell package 'com.databricks:spark-csv.2.11:1.5.0'
code:
val df_csv=sqlContext.read.format("csv").load("file:///home/cloudera/file2")
df_csv.show
Reading JSON files
Inputfile-->jsonfile
{"name":"karnakar","course":"spark","country":"India"}
{"name":"Hina","country":"USA","course":"scala"}
{"course":"hadoop","name":"Hina","country":"India"}
code
val df_json=sqlContext.read.format("json").load("file://home/cloudera/jsonfile")
df_json.show
Reading XML Files:
Inputfile-->
xmlfile
<trainee>
<name>Ramesh</name>
<course>spark</course>
</trainee>
<trainee>
<name>Dinesh</name>
<course>scala</course>
</trainee>
code:
val df_xml=sqlContext.read.format("xml").option("rowTag","trainee").load("file:///home/cloudera/xmlfile")
df.xml.show
Tranformation and writing to save:
---------------------------------
rdd1.collect.foreach(println)
val rdd2=rdd1.filter(x=>x.contains("system"))
rdd2.collect.foreach(println)
val df_csv2=df_csv.where("c2='usa'")
df_csv2.show
Writing to Local:
df_csv2.write.format("json").save("file:///home/cloudera/csvoutput")
df_xml.write,format("csv").save("file://home/cloudera/xmloutput")
rdd2.saveAsTextFile("file://home/cloudera/rddoutput")
In Additonal properties:
option("header","true")
option("inferSchema","true")
option("delimeter","\t")
No comments:
Post a Comment