Pyspark Spark Submit operations
Standalone
Mesos
Kubernetes-Docker
YARN-->HADOOP
Hadoop(HDFS and YARN)-->master slave architecture
YARN->Resource Manager Node Manager--->http portno(RM):8088 RPC portno(RM)--->8032
SNN NM NM NM NM
RM
------------------------------------
metadata b1 b2 b3 b4--------->filex
p1 p2 p3 p4 -->r1=spark.sparkContext.te("")
Application submission:
---------------------------
spark-submit--->script to submit a spark application to a cluster(irrespective of the cluster,language)
spark application
-----------------
1 Driver and Executor(s) per application
Executor--->logical unit created out of certain amount of memory and certain number of cores
Driver--->def is same as that of an executor
from pyspark import SparkConf,SparkContext
import sys
if __name__=='__main__':
conf=SparkConf()
sc=SparkContext('local/yarn-client','wcapp',conf=conf)
r1=sc.textFile(sys.argv[1])
r2=r1.flatMap(lambda x:x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
r2.coalesce(1).saveAsTextFile('sys.argv[2]')
sc.stop()
python wc.py
spark-submit "configuration options" programfilepath inputfilepath outputfilepath
spark-submit wc.py file:///ip1 ip2
spark 2 sparksession
spark1 sparkContext
spark application can run in local(only one node)/cluster(YARN)
local--only 1 machine
cluster --multiple nodes
spark-submitt --help
In YARN cluster-->default values of executer and driver--
Ex/driver-->default memory 1G,default cores 1
FAIR-->multiple apps
Driver can be launched in worker/out of worker
we submit application from edge node
if we want to run driver on edge node..deploy in client node(default)
on one of the workers..deploy in cluster mode
Cluster:
YARN---->distributed computing cluster(Hadoop)Standalone
Mesos
Kubernetes-Docker
YARN-->HADOOP
Hadoop(HDFS and YARN)-->master slave architecture
Master Slave
HDFS--->NameNode,SNN Datanode--->http portno(NN)
:9870 RPC portno(NN)-->8020/9000YARN->Resource Manager Node Manager--->http portno(RM):8088 RPC portno(RM)--->8032
m1 w1 w2
w3 w4
NN DN DN
DN DNSNN NM NM NM NM
RM
------------------------------------
metadata b1 b2 b3 b4--------->filex
p1 p2 p3 p4 -->r1=spark.sparkContext.te("")
8G
8G 8G 8G
4c
4c 4c 4cApplication submission:
---------------------------
spark-submit--->script to submit a spark application to a cluster(irrespective of the cluster,language)
spark application
-----------------
1 Driver and Executor(s) per application
Executor--->logical unit created out of certain amount of memory and certain number of cores
Driver--->def is same as that of an executor
Executor-->to process /compute a partition
Driver-->to take care of all the executor, it doesnot
involve the processing the datafrom pyspark import SparkConf,SparkContext
import sys
if __name__=='__main__':
conf=SparkConf()
sc=SparkContext('local/yarn-client','wcapp',conf=conf)
r1=sc.textFile(sys.argv[1])
r2=r1.flatMap(lambda x:x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
r2.coalesce(1).saveAsTextFile('sys.argv[2]')
sc.stop()
python wc.py
spark-submit "configuration options" programfilepath inputfilepath outputfilepath
spark-submit wc.py file:///ip1 ip2
spark 2 sparksession
spark1 sparkContext
spark application can run in local(only one node)/cluster(YARN)
local--only 1 machine
cluster --multiple nodes
1.how to configure the master(local/cluster(YARN))
while creating spark driver-->master(local/yarn)spark-submitt --help
In YARN cluster-->default values of executer and driver--
Ex/driver-->default memory 1G,default cores 1
App1-->1G,1c
App2 -->2G,2c
inorder to allow multiple application to run on a cluster,
The application must be submitted by FAIR scheduler
spark support 2 scheduler-->FIFO/FAIR
FIFO-only 1 app, can run at a time(default mode)FAIR-->multiple apps
spark-submit
--executer-cores 2 --executer-memory 2G
--conf "spark.
schduler.mode"="FAIR" --driver-cores 2
--driver-memory 2G wc.py ip1 ip2
2.where driver will be launched
Executor always be
launched in worker nodes onlyDriver can be launched in worker/out of worker
if driver has to be
launched in a worker node...deploy the app in cluster mode
if driver out of worker nodes ...deploy the app in client node
Edge Node(Gateway
node)--internal,external networkif driver out of worker nodes ...deploy the app in client node
we submit application from edge node
if we want to run driver on edge node..deploy in client node(default)
on one of the workers..deploy in cluster mode
No comments:
Post a Comment