Wednesday, February 19, 2020

sqoop commands brushup

SQOOP
Chapters in sqoop:
•  List databases
•  List tables
•  EVAL
•  Select query evaluation
•  sqoop Import
•  Import a table from RDBMS
1. Into hdfs by using sqoop
2. Into hive by using sqoop
•  Import all tables from RDBMS
•  Import into a target dIrectory
•  Import  subset of table data
•  Incremental import
•  sqoop export
•  sqoop Job
•  create a sqoop job
•  verify the job
•  Inspect the job
•  execute the job


List databases
$ sqoop list-databases --connect jdbc:mysql://localhost/world --username root -P

List tables
$ sqoop list-tables --connect jdbc:mysql://localhost/world --username root -P

EVAL
•  Select query evaluation
$ sqoop eval --connect jdbc:mysql://localhost/world --username root -P --query "SELECT * FROM city LIMIT 10"


boundary
sqoop import \
  --connect "jdbc:mysql://localhost/world" \
  --username=root \
  --password=cloudera \
  --table city \
  --target-dir /user/cloudera/city2 \
  -m 2 \
  --boundary-query "select 1, 4079 from city limit 1" \

sqoop Import
•  Import a table from RDBMS (mysql)
$ sqoop import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P

sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P

Other sqoop import operations:

•  Import into a target directory
$ sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P --target-dir /user/cloudera/world_city/ -m1

•  Import all tables from RDBMS
$ sqoop import-all-tables --connect jdbc:mysql://localhost/world -m1 --warehouse-dir /user/cloudera/world_city1/ --username root -P

•  Import  subset of table data
$ sqoop import
--connect jdbc:mysql://localhost/world
--table city --where "ID between 1 and 23"
--target-dir /user/cloudera/city1
-m1
-username root
-P

•  delimiter
sqoop import \
--connect jdbc:mysql://localhost/world \
--username=root \
--password=cloudera \
--table city \
--target-dir /user/cloudera/city4 \
--fields-terminated-by \|


•  split-by
•  sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P --split-by ID --target-dir /user/cloudera/lab2

•  Incremental import (append)

$ sqoop import \
  --connect "jdbc:mysql://localhost/world" \
  --username=root \
  -P \
  --table city \
  --target-dir /user/cloudera/city1 \
  --append \
  --check-column "ID" \
  --incremental append \
  --last-value 23

SQOOP EXPORT
mysql> mysql -u root -p
password: cloudera
mysql> use world;
mysql> create table citytest like city;
mysql>  select * from citytest;

Local terminal:

cloudera@localhost> $sqoop export --connect jdbc:mysql://localhost/employee --table emp -m1 --export-dir/user/cloudera/emp  --input --fields --terminated-by ‘,’

sqoop export --connect "jdbc:mysql://localhost/world" \
--username root \
-P \
--table citytest \
--export-dir /user/cloudera/city1/part* \
-m1


SQOOP JOB:
To create a sqoop job:
$sqoop job --create importtoday -- import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P
Verify the sqoop job:
 $sqoop job --list
Insect  the sqoop job:
 $sqoop job --show importtoday
Execute  the sqoop job:
 $sqoop job --exec importtoday

SQOOP HIVE ACTIONS
Steps:
Mysql terminal:
Mysql> show databases;
Mysql>use world;
Mysql> show tables;
Mysql>select * from City;

Hive terminal:
Linux> hive
Hive> select * from City;
Hive> describe City;

Sqoop terminal:(Sqoop import -- hive table )

$ sqoop import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P --hive-import

Important objectives on sqoop:
DATA INJECTION sqoop
1. Import data from a table in a relational database into HDFS
2. Import the results of a query from a relational data bases into HDFS
3. Import a table from relational database into a new or existing Hive
     Table
4. Insert or update data from HDFS into a table in a relational
     Database
   
Code:
1. Import data from a table in a relational database into HDFS
Syntax:
$sqoop import
           --connect jdbc: rdbmsname:
          //localhost/dbname --table tablename --m1 --username root --P
2. Import the results of a query from a relational data bases into HDFS
Syntax:
$sqoop eval
         --connect jdbc: rdbmsname:
//localhost/ dbname  -e query --username root --P

3. Import a table from relational database into a new or existing Hive
     Table
Syntax:
$sqoop import --connect jdbc: rdbmsname:
          //localhost/dbname --table tablename -m1 --hive -import
4. Insert or update data from HDFS into a table in a relational
     Database 
Syntax:
       $sqoop export
           --connect jdbc: rdbmsname:
          //localhost/dbname --table tablename
 --export-dir/dbname/tablename  --username root --P


                 ---------------------------***practice makes perfect *** do more practice***------------------

sqoop import --connect jdbc:teradata://{host name}/Database=retail 
--connection-manager org.apache.sqoop.teradata.TeradataConnManager 
--username dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 
--check-column modified_date --incremental lastmodified --last-value {last_import_date}

sqoop import --connect jdbc:teradata://{host name}/Database=retail 
--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc 
--password dbc --target-dir /user/hive/incremental_table -m 1 
--query 'select * from SOURCE_TBL where modified_date > {last_import_date} AND $CONDITIONS’

Reference :
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Managed table:
sqoop create-hive-table 

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_create_hive_table_literal



sqoop list-databases \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera

sqoop list-tables --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera

sqoop eval \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera \
  --query "select count(1) from City"

sqoop import \
  -m 12 \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera \
  --table City
  --as-avrodatafile \
  --warehouse-dir=/user/hive/warehouse/retail_stage.db

--Default
sqoop import \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera \
  --table City \
  --as-textfile \
  --target-dir=/user/cloudera/departments

sqoop import \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera \
  --table City \
  --as-sequencefile \
  --target-dir=/user/cloudera/City

sqoop import \
  --connect "jdbc:mysql://localhost/world" \
  --username root \
  --password cloudera \
  --table City \
  --as-avrodatafile \
  --target-dir=/user/cloudera/departments

-- A file with extension avsc will be created under the directory from which sqoop import is executed
-- Copy avsc file to HDFS location
-- Create hive table with LOCATION to /user/cloudera/departments and TBLPROPERTIES pointing to avsc file
hadoop fs -put sqoop_import_departments.avsc /user/cloudera

CREATE EXTERNAL TABLE departments
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///user/cloudera/departments'
TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart.cloudera/user/cloudera/sqoop_import_departments.avsc');


-- It will create tables in default database in hive
-- Using snappy compression
-- As we have imported all tables before make sure you drop the directories
-- Launch hive drop all tables
drop table departments;
drop table categories;
drop table products;
drop table orders;
drop table order_items;
drop table customers;

-- Dropping directories, in case your hive database/tables in consistent state
hadoop fs -rm -R /user/hive/warehouse/departments
hadoop fs -rm -R /user/hive/warehouse/categories
hadoop fs -rm -R /user/hive/warehouse/products
hadoop fs -rm -R /user/hive/warehouse/orders 
hadoop fs -rm -R /user/hive/warehouse/order_itmes
hadoop fs -rm -R /user/hive/warehouse/customers

sqoop import-all-tables \
  --num-mappers 1 \
  --connect "jdbc:mysql://localhost/world" \
  --username=root \
  --password=cloudera \
  --hive-import \
  --hive-overwrite \
  --create-hive-table \
  --compress \
  --compression-codec org.apache.hadoop.io.compress.SnappyCodec \
  --outdir java_files

sudo -u hdfs hadoop fs -mkdir /user/cloudera/retail_stage
sudo -u hdfs hadoop fs -chmod +rw /user/cloudera/retail_stage
hadoop fs -copyFromLocal ~/*.avsc /user/cloudera/retail_stage

-- Basic import
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/cloudera/departments 

-- Boundary Query and columns
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --boundary-query "select 2, 8 from departments limit 1" \
  --columns department_id,department_name

-- query and split-by
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --query="select * from orders join order_items on orders.order_id = order_items.order_item_order_id where \$CONDITIONS" \
  --target-dir /user/cloudera/order_join \
  --split-by order_id \
  --num-mappers 4

-- Copying into existing table or directory (append)
-- Customizing number of threads (num-mappers)
-- Changing delimiter
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/hive/warehouse/retail_ods.db/departments \
  --append \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --num-mappers 1 \
  --outdir java_files

-- Importing table with out primary key using multiple threads (split-by)
-- When using split-by, using indexed column is highly desired
-- If the column is not indexed then performance will be bad 
-- because of full table scan by each of the thread
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/hive/warehouse/retail_ods.db/departments \
  --append \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --split-by department_id \
  --outdir java_files

-- Getting delta (--where)
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/hive/warehouse/retail_ods.db/departments \
  --append \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --split-by department_id \
  --where "department_id > 7" \
  --outdir java_files

-- Incremental load
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/hive/warehouse/retail_ods.db/departments \
  --append \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --check-column "department_id" \
  --incremental append \
  --last-value 7 \
  --outdir java_files

sqoop job --create sqoop_job \
  -- import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --target-dir /user/hive/warehouse/retail_ods.db/departments \
  --append \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --check-column "department_id" \
  --incremental append \
  --last-value 7 \
  --outdir java_files

sqoop job --list

sqoop job --show sqoop_job

sqoop job --exec sqoop_job

-- Hive related
-- Overwrite existing data associated with hive table (hive-overwrite)
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --hive-home /user/hive/warehouse/retail_ods.db \
  --hive-import \
  --hive-overwrite \
  --hive-table departments \
  --outdir java_files

--Create hive table example
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --fields-terminated-by '|' \
  --lines-terminated-by '\n' \
  --hive-home /user/hive/warehouse \
  --hive-import \
  --hive-table departments_test \
  --create-hive-table \
  --outdir java_files

--Connect to mysql and create database for reporting database
--user:root, password:cloudera
mysql -u root -p
create database retail_rpt_db;
grant all on retail_rpt_db.* to retail_dba;
flush privileges;
use retail_rpt_db;
create table departments as select * from retail_db.departments where 1=2;
exit;

--For certification change database name retail_rpt_db to retail_db
sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_rpt_db" \
       --username retail_dba \
       --password cloudera \
       --table departments \
       --export-dir /user/hive/warehouse/retail_ods.db/departments \
       --input-fields-terminated-by '|' \
       --input-lines-terminated-by '\n' \
       --num-mappers 2 \
       --batch \
       --outdir java_files

sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --table departments \
  --export-dir /user/cloudera/sqoop_import/departments_export \
  --batch \
  --outdir java_files \
  -m 1 \
  --update-key department_id \
  --update-mode allowinsert

sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --table departments_test \
  --export-dir /user/hive/warehouse/departments_test \
  --input-fields-terminated-by '\001' \
  --input-lines-terminated-by '\n' \
  --num-mappers 2 \
  --batch \
  --outdir java_files \
  --input-null-string nvl \
  --input-null-non-string -1

--Merge process begins
hadoop fs -mkdir /user/cloudera/sqoop_merge

--Initial load
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --as-textfile \
  --target-dir=/user/cloudera/sqoop_merge/departments

--Validate
sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --query "select * from departments" 

hadoop fs -cat /user/cloudera/sqoop_merge/departments/part*

--update
sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --query "update departments set department_name='Testing Merge' where department_id = 9000"

--Insert
sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --query "insert into departments values (10000, 'Inserting for merge')"

sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username retail_dba \
  --password cloudera \
  --query "select * from departments"

--New load
sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  --password=cloudera \
  --table departments \
  --as-textfile \
  --target-dir=/user/cloudera/sqoop_merge/departments_delta \
  --where "department_id >= 9000"

hadoop fs -cat /user/cloudera/sqoop_merge/departments_delta/part*

--Merge
sqoop merge --merge-key department_id \
  --new-data /user/cloudera/sqoop_merge/departments_delta \
  --onto /user/cloudera/sqoop_merge/departments \
  --target-dir /user/cloudera/sqoop_merge/departments_stage \
  --class-name departments \
  --jar-file

hadoop fs -cat /user/cloudera/sqoop_merge/departments_stage/part*

--Delete old directory
hadoop fs -rm -R /user/cloudera/sqoop_merge/departments

--Move/rename stage directory to original directory
hadoop fs -mv /user/cloudera/sqoop_merge/departments_stage /user/cloudera/sqoop_merge/departments 

--Validate that original directory have merged data
hadoop fs -cat /user/cloudera/sqoop_merge/departments/part*

--Merge process ends


sqoop import --connect "jdbc:mysql://localhost/world" --username=root --password=cloudera --query="select City.*,CountryLanguage.Percentage,CountryLanguage.IsOfficial from City join CountryLanguage on City.CountryCode = CountryLanguage.CountryCode where City.ID=4079 and \$CONDITIONS" --target-dir /user/cloudera/order_join -m 1


No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...