Friday, March 8, 2019

BigDataPipelineScenorio


BigDataPipelineScenorio
use case:
----------
big data world
          3 stages/phases-pipeline
end -------------------------->end
1.data ingestion
2.data preprocessing or data cleansing
3.data analytics
----------------------------------
1.1.data ingestion
       pulling the data from External storage system to Hadoop Storage System(HDFS)
                  client used to maintain data in their servers on premises/cloud
                  pulling the data from premises/remote servers to HDFS                

       where the data is available
                  servers on premises/cloud
                  cloude providers-->AWS,GCP,Azure(storage,compute,intelligence....)
                  S3(A storage service in AWS cloude)(Simple Storage Service)
                  Blob storage(Azure)
                  Google cloude Storage{Google cloud storage service)   

                  -->formats of data
                     structured,semi structured,undtructured 
                                structured-Tables
                                semi structured->csv,json,xml
                                unstructured-->non proper organised data                        
2.Preprocessing or cleansing of data
     input--->structured,semistructured,unstructured
                  output-->structured data
                  structured--adding some data/deleting some data,dealing with null values
                  semistructured/unstructured-->structured
3.Analysis
      Warehouses--->Hive,Snowflake(AWS)

s3(raw)--->preprocessing(gets ready for analysis)--->store the data in some storage system(warehouse)
large volumes-Spark---------------->Store into WH
(str,semi,un)
spark read,preprocess and store into WH
spark can do all this by residing on a distributed computing cluster
cluster-->YRAN,StandAlone,Mesos,Kubernets
pipeline has to be automated-->cron
AirFlow-->pipeline creater as well as scheduler(python script)
            

                               

                               

No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...