Bigdata_Spark_Consultant: Resiliance in PYSPARK

Monday, March 4, 2019

Resiliance in PYSPARK

Resilience in PYSPARK

Resilience:
Lineage
DAG
Persis stance

Resiliance--> Rolling back to the original state

r1=spark.sparkContext.textFile("")

r2=r1.t1()   r2--->r1 will be released
r3=r2.t2()   r3--->r2 will be released
r4=r3.t3()    r4--->r3 will be released
r5=r4.t4()     r5-->r5 will be released
r5.action()
r6=r3.t5()
r7=r6.t6()
r7.action2()
spark instead of execution a transformation,it remembers the transformation.
spark remembers the dependency of transformation in the form of a graph called as lineage graph

r5-->r4-->r3-->r2-->r1-->textFile(tracing the dependency)
Execution start from
textfile->r1->r2->r3->r4->r5(Execution of the flow-->lineage)

RDD(tf)-->RDD/DF

DF(tr)-->DF/RDD

Where does this lineage stored and how long?

memory and as long as the application is active
The execution of the transformation will be represented as a graph called
(DAG-Direct asyclic graph)

from pyspark.sql import SparkSession
from pyspark import StorageLevel

spark=SparkSession.builder.master('local').appName('app').getOrCreate()

r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")

r2=r1.map(lambda x:x.encode('utf-8'))

r2.persist(StorageLevel.MEMORY_ONLY) #r2 wlll be persisted in memory

r3=r2.filter(lambda x:'SUCCESS' in x)

print(r3.count())

r4=r2.filter(lambda x:"FAILED" in x)

print(r4.count())

r2.unpersist() #to unpersist or release the persisted data

r3.persist()---#the data will be persisted to memory,de-serialized,1 replication

Dot-->RDD(Object)

arrow-->transformation

textFile-->r1--t1-->r2--t2-->r3--t3--->r4--t4-->r5(action)

textFile-->r1--t1-->r2--t2-->r3--t5-->r6--t6-->r7(Action)

once an action is finished-->the data will be lost
application completion-->
Persistance
------------
reuse of the data-->performance will increase
the data can be persisted in memory and disk,serialized,de-serialized,with or without replication
Perisitance Level:
MEMORY_ONLY
MEMORY_ONLY_SER
DISK_ONLY -->disk,serial,1 copy
MEMORY_AND_DISK -->if space in memory that memory ,de-serialized,1 copy
MEMORY_AND_DISK_SER
MEMORY_ONLY_2
MEMORY_ONLY_SER_2
DISK_ONLY_2
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER_2
OFF_HEAP

persistance is only a request not a demand

Bigdata_Spark_Consultant

Monday, March 4, 2019

Resiliance in PYSPARK

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels