Resilience in PYSPARK
Resilience:
Lineage
DAG
Persis stance
Resiliance--> Rolling back to the original state
r3=r2.t2() r3--->r2 will be released
r4=r3.t3() r4--->r3 will be released
r5=r4.t4() r5-->r5 will be released
r5.action()
r6=r3.t5()
r7=r6.t6()
r7.action2()
spark instead of execution a transformation,it remembers the transformation.
spark remembers the dependency of transformation in the form of a graph called as lineage graph
r5-->r4-->r3-->r2-->r1-->textFile(tracing the dependency)
Execution start from
textfile->r1->r2->r3->r4->r5(Execution of the flow-->lineage)
The execution of the transformation will be represented as a graph called
(DAG-Direct asyclic graph)
from pyspark.sql import SparkSession
from pyspark import StorageLevel
r4=r2.filter(lambda x:"FAILED" in x)
r3.persist()---#the data will be persisted to memory,de-serialized,1 replication
application completion-->
Persistance
------------
reuse of the data-->performance will increase
the data can be persisted in memory and disk,serialized,de-serialized,with or without replication
Perisitance Level:
MEMORY_ONLY
MEMORY_ONLY_SER
DISK_ONLY -->disk,serial,1 copy
MEMORY_AND_DISK -->if space in memory that memory ,de-serialized,1 copy
MEMORY_AND_DISK_SER
MEMORY_ONLY_2
MEMORY_ONLY_SER_2
DISK_ONLY_2
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER_2
OFF_HEAP
Resilience:
Lineage
DAG
Persis stance
r1=spark.sparkContext.textFile("")
r2=r1.t1()
r2--->r1 will be releasedr3=r2.t2() r3--->r2 will be released
r4=r3.t3() r4--->r3 will be released
r5=r4.t4() r5-->r5 will be released
r5.action()
r6=r3.t5()
r7=r6.t6()
r7.action2()
spark instead of execution a transformation,it remembers the transformation.
spark remembers the dependency of transformation in the form of a graph called as lineage graph
r5-->r4-->r3-->r2-->r1-->textFile(tracing the dependency)
Execution start from
textfile->r1->r2->r3->r4->r5(Execution of the flow-->lineage)
RDD(tf)-->RDD/DF
DF(tr)-->DF/RDD
Where does this lineage stored and how long?
memory and as long as the application is activeThe execution of the transformation will be represented as a graph called
(DAG-Direct asyclic graph)
from pyspark.sql import SparkSession
from pyspark import StorageLevel
spark=SparkSession.builder.master('local').appName('app').getOrCreate()
r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")
r2=r1.map(lambda x:x.encode('utf-8'))
r2.persist(StorageLevel.MEMORY_ONLY) #r2 wlll be persisted
in memory
r3=r2.filter(lambda x:'SUCCESS' in x)
print(r3.count())r4=r2.filter(lambda x:"FAILED" in x)
print(r4.count())
r2.unpersist() #to
unpersist or release the persisted datar3.persist()---#the data will be persisted to memory,de-serialized,1 replication
Dot-->RDD(Object)
arrow-->transformation
textFile-->r1--t1-->r2--t2-->r3--t3--->r4--t4-->r5(action)
textFile-->r1--t1-->r2--t2-->r3--t5-->r6--t6-->r7(Action)
once an action is finished-->the data will be lostapplication completion-->
Persistance
------------
reuse of the data-->performance will increase
the data can be persisted in memory and disk,serialized,de-serialized,with or without replication
Perisitance Level:
MEMORY_ONLY
MEMORY_ONLY_SER
DISK_ONLY -->disk,serial,1 copy
MEMORY_AND_DISK -->if space in memory that memory ,de-serialized,1 copy
MEMORY_AND_DISK_SER
MEMORY_ONLY_2
MEMORY_ONLY_SER_2
DISK_ONLY_2
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER_2
OFF_HEAP
persistance is only a request not a demand
No comments:
Post a Comment