Bigdata_Spark_Consultant: BigDataPipelineScenorio

Friday, March 8, 2019

BigDataPipelineScenorio

BigDataPipelineScenorio

use case:
----------
big data world
3 stages/phases-pipeline
end -------------------------->end
1.data ingestion
2.data preprocessing or data cleansing
3.data analytics
----------------------------------
1.1.data ingestion
pulling the data from External storage system to Hadoop Storage System(HDFS)
client used to maintain data in their servers on premises/cloud
pulling the data from premises/remote servers to HDFS

where the data is available

servers on premises/cloud
cloude providers-->AWS,GCP,Azure(storage,compute,intelligence....)
S3(A storage service in AWS cloude)(Simple Storage Service)
Blob storage(Azure)
Google cloude Storage{Google cloud storage service)

-->formats of data

structured,semi structured,undtructured

structured-Tables

semi structured->csv,json,xml

unstructured-->non proper organised data

2.Preprocessing or cleansing of data

input--->structured,semistructured,unstructured

output-->structured data

structured--adding some data/deleting some data,dealing with null values

semistructured/unstructured-->structured
3.Analysis
Warehouses--->Hive,Snowflake(AWS)

s3(raw)--->preprocessing(gets ready for analysis)--->store the data in some storage system(warehouse)

large volumes-Spark---------------->Store into WH
(str,semi,un)
spark read,preprocess and store into WH
spark can do all this by residing on a distributed computing cluster
cluster-->YRAN,StandAlone,Mesos,Kubernets
pipeline has to be automated-->cron
AirFlow-->pipeline creater as well as scheduler(python script)

Bigdata_Spark_Consultant

Friday, March 8, 2019

BigDataPipelineScenorio

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels