Monday, March 4, 2019

Resiliance in PYSPARK

Resilience in PYSPARK

Resilience:
Lineage
DAG
Persis stance
Resiliance--> Rolling back to the original state

r1=spark.sparkContext.textFile("")
r2=r1.t1()   r2--->r1 will be released
r3=r2.t2()   r3--->r2 will be released
r4=r3.t3()    r4--->r3 will be released
r5=r4.t4()     r5-->r5 will be released
r5.action()
r6=r3.t5()
r7=r6.t6()
r7.action2()
spark instead of execution a transformation,it remembers the transformation.
spark remembers the dependency of transformation in the form of a graph called as lineage graph

r5-->r4-->r3-->r2-->r1-->textFile(tracing the dependency)
Execution start from
textfile->r1->r2->r3->r4->r5(Execution of the flow-->lineage)

RDD(tf)-->RDD/DF
DF(tr)-->DF/RDD

Where does this lineage stored and how long?
memory and as long as the application is active
The execution of the transformation will be represented as a graph called
(DAG-Direct asyclic graph)

from pyspark.sql import SparkSession
from pyspark import StorageLevel

spark=SparkSession.builder.master('local').appName('app').getOrCreate()
r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")

r2=r1.map(lambda x:x.encode('utf-8'))
r2.persist(StorageLevel.MEMORY_ONLY) #r2 wlll be persisted in memory

r3=r2.filter(lambda x:'SUCCESS' in x)
print(r3.count())

r4=r2.filter(lambda x:"FAILED" in x)

print(r4.count())
r2.unpersist()  #to unpersist or release the persisted data

r3.persist()---#the data will be persisted to memory,de-serialized,1 replication

Dot-->RDD(Object)
arrow-->transformation

textFile-->r1--t1-->r2--t2-->r3--t3--->r4--t4-->r5(action)

textFile-->r1--t1-->r2--t2-->r3--t5-->r6--t6-->r7(Action)
once an action is finished-->the data will be lost
application completion-->
Persistance
------------
reuse of the data-->performance will increase
the data can be persisted in memory and disk,serialized,de-serialized,with or without replication
Perisitance Level:
MEMORY_ONLY
MEMORY_ONLY_SER
DISK_ONLY  -->disk,serial,1 copy
MEMORY_AND_DISK  -->if space in memory that memory ,de-serialized,1 copy
MEMORY_AND_DISK_SER
MEMORY_ONLY_2
MEMORY_ONLY_SER_2
DISK_ONLY_2
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER_2
OFF_HEAP

persistance is only a request not a demand

No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...