BigDataPipelineScenorio
use case:----------
big data world
3 stages/phases-pipeline
end -------------------------->end
1.data ingestion
2.data preprocessing or data cleansing
3.data analytics
----------------------------------
1.1.data ingestion
pulling the data from External storage system to Hadoop Storage System(HDFS)
client used to maintain data in their servers on premises/cloud
pulling the data from premises/remote servers to HDFS
where the data
is available
servers on premises/cloudcloude providers-->AWS,GCP,Azure(storage,compute,intelligence....)
S3(A storage service in AWS cloude)(Simple Storage Service)
Blob storage(Azure)
Google cloude Storage{Google cloud storage service)
-->formats of data
structured,semi structured,undtructured
structured-Tables
semi structured->csv,json,xml
unstructured-->non proper organised data
2.Preprocessing or cleansing of data
input--->structured,semistructured,unstructured
output-->structured data
structured--adding some data/deleting some
data,dealing with null values
semistructured/unstructured-->structured3.Analysis
Warehouses--->Hive,Snowflake(AWS)
s3(raw)--->preprocessing(gets ready for
analysis)--->store the data in some storage system(warehouse)
large volumes-Spark---------------->Store into WH(str,semi,un)
spark read,preprocess and store into WH
spark can do all this by residing on a distributed computing cluster
cluster-->YRAN,StandAlone,Mesos,Kubernets
pipeline has to be automated-->cron
AirFlow-->pipeline creater as well as scheduler(python script)
No comments:
Post a Comment