Bigdata_Spark_Consultant: AWS services and operations

AWS
----------
many services

storage service---s3(simple storage service)(low cost storage service)

s3 is distributed ,fault tolerant (replicated) storage system

HDFS-->file system
data is stored in s3 in buckets(object)
bucket-->directories/files

compute services:
-----------------
spark,hadoop---->EC2(Elastic compute cloud),EMR(Elastic Map reduce)

EC2-->an instance is a node with an operating system(linux,windows,macos)
with some configuration

generated key pair(private key and public key)-->private key will be with us and aws instance uses public keys.

private key and public key

SSH-22

machine                  EC2 instance
SSH protocol             public key
(22)
private key(.pem)

Every instance will have an internal ip and external ip
tool-->Putty
putty needs hostname(ip address) and .ppk file to connect to a remote instance through SSH

1.we can generage ..pk file from .pem file using puttygen
2.use this .ppk file with putty to connect to remote instance

-------------------------------------------------------------
EC2 instance --->Ubuntu 16.04,python 3.5.2
instance with hadoop,spark--->EMR
--------------------------------------------------------------

in order to access any AWS account, we need to provide security credentials
1.access key
2.secret key
2 types of users
1.Admin(full access to all AWS)
2.IAM(identity and access management) users(restricted privileges)

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key',' ')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key',' ')

df1=sqlContext.read.format('json').load('s3a://pysparkinput/world_bank_json.txt')

df.show(5)

df2.erite.format('parquet').partitionBy('date').save("s3a://pysparkoutput/dfout/')

spark app-->read and write from s3

copy the hadoop common
org.apache.hadoop:hadoop-common:2.6.3

hadoop aws library
org.apache.hadoop:hadoop-aws:2.6.3

.config('spark.jars.packages','org.apache.hadoop:hadoop-aws:2.6.3')

tools that can be used to interact with AWS services
Accessing EC2 instance-->
we can use spark to read and write from AWS S3

a python package to interact with AWS services(including s3) programitically is "boto3"
a python package interact with AWS service(including s3) via a command line is "awscli"
access key and secret key

boto --->python package to interact with s3

cli tool-->awscli--->all the services of amazon

pip install boto3

import boto3
#to interact with s3, we need credentials (Access key and Secret key)

Users:
1.root(Admin)
2.IAM(restrict previleges)

a user will have credentials(access keys and secret key)

s3client=boto3.resource('s3') #creates a client

#Accessing the buckets
buckets=s3client.buckets.all()
print(buckets)
for bucket in buckets:
print(bucket.name)

1.setting up an environment variables
2.under the current user...create a folder called a .aws/credentials or config(.ini)

aws_access_key_id=
aws_secret_access=

#creating a buckets
s3client.create_bucket(Bucket='boto3bucket',location)

#accessing the bucket
readbucket=s3client.Bucket('pysparkinput')
bucketlist=readbucket.objects.all()
for object in bucketlist:
   print(object.key)

#Command line --CMD prompt
pip install awscli
>aws

>aws s3 ls

>aws s3 ls pysparkinput

>aws s3 rm s3://pysparkinput/xxx

>aws configure

>aws s3 cp c://sample s3://pysparkinput

performance tuning--->persistance,checkpointing,memory management(executors),shuffling behaviour,

static and dynamic

executors memory is 5G

memory-->storage and processing...

5GB-->100%
2.5GB-->50% storage
2.5GB-->50% processing

dynamic--->any fractions
5GB-->dynamically
auto scaling-->5 executors-->

serializer-->java serializer,kyro serializer

spark streaming
directory-->file

Bigdata_Spark_Consultant

Tuesday, March 12, 2019

AWS services and operations

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels