Persistance in Pyspark
r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")
r1.persist(StorageLevel.MEMORY_ONLY)
r2.persist(StorageLevel.MEMORY_ONLY_SER)
print(r1.count())
print(r2.count())
In python-->pickle serializer
In java/scala--> both serilization/deserial over there
In java/scala--2 serializer-->1.java serializer(default) 2.kyro serializer
storage(writing)-->serialization
read-->deserialization
object can be used for processing only in deserialized mode
object(deserial)-->processing
object(serialization)-->not processing
in python--> object can be stored only in serialization mode
deserial-->saves time
under what situations objects get serialized?
1.storing/caching
2.when object(s) are transferred over the network(mandatory)
RDD-->divided into partitions
if data(partitions) is lost during processing-->spark can automatically re-generate the lost
ones with lineage(resilience)
2.in re-generate of lost data
in case of data loss--->regeneration-->persistence
if one replica of persisted data gets lost-->another replica from another node
data transfer from one node to another-->timex
if in only one node,regeneration should be from the beginning-->timey
replica is available on node 2
and
timey<timex
dataframe also can be persisted
from pyspark import StorageLevel
r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")r1=spark.sparkContext.textFile("file:///home/hadoop/sample.txt")
r1.persist(StorageLevel.MEMORY_ONLY)
r2.persist(StorageLevel.MEMORY_ONLY_SER)
print(r1.count())
print(r2.count())
In python-->pickle serializer
In java/scala--> both serilization/deserial over there
In java/scala--2 serializer-->1.java serializer(default) 2.kyro serializer
Serialization-->data size gets reduced--->JVM object
converting into binary object
Deserialisation-->data size will be as it is-->binary
object back to JVM object
data will be by default in deserialized mode
SerDestorage(writing)-->serialization
read-->deserialization
object can be used for processing only in deserialized mode
object(deserial)-->processing
object(serialization)-->not processing
in python--> object can be stored only in serialization mode
in java/scala-->object can be stroed in
serialization/deserialmode
serialization-->saves spacedeserial-->saves time
under what situations objects get serialized?
1.storing/caching
2.when object(s) are transferred over the network(mandatory)
pyspark program--->python objects<-----PY4J
module----->JVM objects
scala/java program-->JVM objectsRDD-->divided into partitions
if data(partitions) is lost during processing-->spark can automatically re-generate the lost
ones with lineage(resilience)
if no of transformation are less-->regeneration from the
beginning can be accepted
if no of transformation are more-->regeneration from the
beginning of the flow is not a good choice
persistance can be under 2 situations:
1.in multiple action2.in re-generate of lost data
Replication:
-------------in case of data loss--->regeneration-->persistence
if one replica of persisted data gets lost-->another replica from another node
data transfer from one node to another-->timex
if in only one node,regeneration should be from the beginning-->timey
replica is available on node 2
and
timey<timex
dataframe also can be persisted
No comments:
Post a Comment