Bigdata_Spark_Consultant: February 2020

Wednesday, February 19, 2020

sqoop commands brushup

SQOOP
Chapters in sqoop:
• List databases
• List tables
• EVAL
• Select query evaluation
• sqoop Import
• Import a table from RDBMS
1. Into hdfs by using sqoop
2. Into hive by using sqoop
• Import all tables from RDBMS
• Import into a target dIrectory
• Import subset of table data
• Incremental import
• sqoop export
• sqoop Job
• create a sqoop job
• verify the job
• Inspect the job
• execute the job

List databases
$ sqoop list-databases --connect jdbc:mysql://localhost/world --username root -P

List tables
$ sqoop list-tables --connect jdbc:mysql://localhost/world --username root -P

EVAL
• Select query evaluation
$ sqoop eval --connect jdbc:mysql://localhost/world --username root -P --query "SELECT * FROM city LIMIT 10"

boundary
sqoop import \
--connect "jdbc:mysql://localhost/world" \
--username=root \
--password=cloudera \
--table city \
--target-dir /user/cloudera/city2 \
-m 2 \
--boundary-query "select 1, 4079 from city limit 1" \

sqoop Import
• Import a table from RDBMS (mysql)
$ sqoop import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P

sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P

Other sqoop import operations:

• Import into a target directory
$ sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P --target-dir /user/cloudera/world_city/ -m1

• Import all tables from RDBMS
$ sqoop import-all-tables --connect jdbc:mysql://localhost/world -m1 --warehouse-dir /user/cloudera/world_city1/ --username root -P

• Import subset of table data
$ sqoop import
--connect jdbc:mysql://localhost/world
--table city --where "ID between 1 and 23"
--target-dir /user/cloudera/city1
-m1
-username root
-P

• delimiter
sqoop import \
--connect jdbc:mysql://localhost/world \
--username=root \
--password=cloudera \
--table city \
--target-dir /user/cloudera/city4 \
--fields-terminated-by \|

• split-by
• sqoop import --connect jdbc:mysql://localhost/world --table city --username root -P --split-by ID --target-dir /user/cloudera/lab2

• Incremental import (append)

$ sqoop import \
--connect "jdbc:mysql://localhost/world" \
--username=root \
-P \
--table city \
--target-dir /user/cloudera/city1 \
--append \
--check-column "ID" \
--incremental append \
--last-value 23

SQOOP EXPORT
mysql> mysql -u root -p
password: cloudera
mysql> use world;
mysql> create table citytest like city;
mysql> select * from citytest;

Local terminal:

cloudera@localhost> $sqoop export --connect jdbc:mysql://localhost/employee --table emp -m1 --export-dir/user/cloudera/emp --input --fields --terminated-by ‘,’

sqoop export --connect "jdbc:mysql://localhost/world" \
--username root \
-P \
--table citytest \
--export-dir /user/cloudera/city1/part* \
-m1

SQOOP JOB:
To create a sqoop job:
$sqoop job --create importtoday -- import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P
Verify the sqoop job:
$sqoop job --list
Insect the sqoop job:
$sqoop job --show importtoday
Execute the sqoop job:
$sqoop job --exec importtoday

SQOOP HIVE ACTIONS
Steps:
Mysql terminal:
Mysql> show databases;
Mysql>use world;
Mysql> show tables;
Mysql>select * from City;

Hive terminal:
Linux> hive
Hive> select * from City;
Hive> describe City;

Sqoop terminal:(Sqoop import -- hive table )

$ sqoop import --connect jdbc:mysql://localhost/world --table city -m1 --username root -P --hive-import

Important objectives on sqoop:
DATA INJECTION sqoop
1. Import data from a table in a relational database into HDFS
2. Import the results of a query from a relational data bases into HDFS
3. Import a table from relational database into a new or existing Hive
Table
4. Insert or update data from HDFS into a table in a relational
Database

Code:
1. Import data from a table in a relational database into HDFS
Syntax:
$sqoop import
--connect jdbc: rdbmsname:
//localhost/dbname --table tablename --m1 --username root --P
2. Import the results of a query from a relational data bases into HDFS
Syntax:
$sqoop eval
--connect jdbc: rdbmsname:
//localhost/ dbname -e query --username root --P

3. Import a table from relational database into a new or existing Hive
Table
Syntax:
$sqoop import --connect jdbc: rdbmsname:
//localhost/dbname --table tablename -m1 --hive -import
4. Insert or update data from HDFS into a table in a relational
Database
Syntax:
$sqoop export
--connect jdbc: rdbmsname:
//localhost/dbname --table tablename
--export-dir/dbname/tablename --username root --P

---------------------------***practice makes perfect *** do more practice***------------------

sqoop import --connect jdbc:teradata://{host name}/Database=retail

--connection-manager org.apache.sqoop.teradata.TeradataConnManager

--username dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1

--check-column modified_date --incremental lastmodified --last-value {last_import_date}

sqoop import --connect jdbc:teradata://{host name}/Database=retail

--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc

--password dbc --target-dir /user/hive/incremental_table -m 1

--query 'select * from SOURCE_TBL where modified_date > {last_import_date} AND $CONDITIONS’

Reference :

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Managed table:

sqoop create-hive-table

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_create_hive_table_literal

sqoop list-databases \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera

sqoop list-tables --connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera

sqoop eval \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera \

--query "select count(1) from City"

sqoop import \

-m 12 \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera \

--table City

--as-avrodatafile \

--warehouse-dir=/user/hive/warehouse/retail_stage.db

--Default

sqoop import \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera \

--table City \

--as-textfile \

--target-dir=/user/cloudera/departments

sqoop import \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera \

--table City \

--as-sequencefile \

--target-dir=/user/cloudera/City

sqoop import \

--connect "jdbc:mysql://localhost/world" \

--username root \

--password cloudera \

--table City \

--as-avrodatafile \

--target-dir=/user/cloudera/departments

-- A file with extension avsc will be created under the directory from which sqoop import is executed

-- Copy avsc file to HDFS location

-- Create hive table with LOCATION to /user/cloudera/departments and TBLPROPERTIES pointing to avsc file

hadoop fs -put sqoop_import_departments.avsc /user/cloudera

CREATE EXTERNAL TABLE departments

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION 'hdfs:///user/cloudera/departments'

TBLPROPERTIES ('avro.schema.url'='hdfs://quickstart.cloudera/user/cloudera/sqoop_import_departments.avsc');

-- It will create tables in default database in hive

-- Using snappy compression

-- As we have imported all tables before make sure you drop the directories

-- Launch hive drop all tables

drop table departments;

drop table categories;

drop table products;

drop table orders;

drop table order_items;

drop table customers;

-- Dropping directories, in case your hive database/tables in consistent state

hadoop fs -rm -R /user/hive/warehouse/departments

hadoop fs -rm -R /user/hive/warehouse/categories

hadoop fs -rm -R /user/hive/warehouse/products

hadoop fs -rm -R /user/hive/warehouse/orders

hadoop fs -rm -R /user/hive/warehouse/order_itmes

hadoop fs -rm -R /user/hive/warehouse/customers

sqoop import-all-tables \

--num-mappers 1 \

--connect "jdbc:mysql://localhost/world" \

--username=root \

--password=cloudera \

--hive-import \

--hive-overwrite \

--create-hive-table \

--compress \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec \

--outdir java_files

sudo -u hdfs hadoop fs -mkdir /user/cloudera/retail_stage

sudo -u hdfs hadoop fs -chmod +rw /user/cloudera/retail_stage

hadoop fs -copyFromLocal ~/*.avsc /user/cloudera/retail_stage

-- Basic import

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/cloudera/departments

-- Boundary Query and columns

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/cloudera/departments \

-m 2 \

--boundary-query "select 2, 8 from departments limit 1" \

--columns department_id,department_name

-- query and split-by

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--query="select * from orders join order_items on orders.order_id = order_items.order_item_order_id where \$CONDITIONS" \

--target-dir /user/cloudera/order_join \

--split-by order_id \

--num-mappers 4

-- Copying into existing table or directory (append)

-- Customizing number of threads (num-mappers)

-- Changing delimiter

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/hive/warehouse/retail_ods.db/departments \

--append \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--num-mappers 1 \

--outdir java_files

-- Importing table with out primary key using multiple threads (split-by)

-- When using split-by, using indexed column is highly desired

-- If the column is not indexed then performance will be bad

-- because of full table scan by each of the thread

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/hive/warehouse/retail_ods.db/departments \

--append \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--split-by department_id \

--outdir java_files

-- Getting delta (--where)

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/hive/warehouse/retail_ods.db/departments \

--append \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--split-by department_id \

--where "department_id > 7" \

--outdir java_files

-- Incremental load

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/hive/warehouse/retail_ods.db/departments \

--append \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--check-column "department_id" \

--incremental append \

--last-value 7 \

--outdir java_files

sqoop job --create sqoop_job \

-- import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--target-dir /user/hive/warehouse/retail_ods.db/departments \

--append \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--check-column "department_id" \

--incremental append \

--last-value 7 \

--outdir java_files

sqoop job --list

sqoop job --show sqoop_job

sqoop job --exec sqoop_job

-- Hive related

-- Overwrite existing data associated with hive table (hive-overwrite)

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--hive-home /user/hive/warehouse/retail_ods.db \

--hive-import \

--hive-overwrite \

--hive-table departments \

--outdir java_files

--Create hive table example

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--fields-terminated-by '|' \

--lines-terminated-by '\n' \

--hive-home /user/hive/warehouse \

--hive-import \

--hive-table departments_test \

--create-hive-table \

--outdir java_files

--Connect to mysql and create database for reporting database

--user:root, password:cloudera

mysql -u root -p

create database retail_rpt_db;

grant all on retail_rpt_db.* to retail_dba;

flush privileges;

use retail_rpt_db;

create table departments as select * from retail_db.departments where 1=2;

exit;

--For certification change database name retail_rpt_db to retail_db

sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_rpt_db" \

--username retail_dba \

--password cloudera \

--table departments \

--export-dir /user/hive/warehouse/retail_ods.db/departments \

--input-fields-terminated-by '|' \

--input-lines-terminated-by '\n' \

--num-mappers 2 \

--batch \

--outdir java_files

sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--table departments \

--export-dir /user/cloudera/sqoop_import/departments_export \

--batch \

--outdir java_files \

-m 1 \

--update-key department_id \

--update-mode allowinsert

sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--table departments_test \

--export-dir /user/hive/warehouse/departments_test \

--input-fields-terminated-by '\001' \

--input-lines-terminated-by '\n' \

--num-mappers 2 \

--batch \

--outdir java_files \

--input-null-string nvl \

--input-null-non-string -1

--Merge process begins

hadoop fs -mkdir /user/cloudera/sqoop_merge

--Initial load

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--as-textfile \

--target-dir=/user/cloudera/sqoop_merge/departments

--Validate

sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--query "select * from departments"

hadoop fs -cat /user/cloudera/sqoop_merge/departments/part*

--update

sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--query "update departments set department_name='Testing Merge' where department_id = 9000"

--Insert

sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--query "insert into departments values (10000, 'Inserting for merge')"

sqoop eval --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username retail_dba \

--password cloudera \

--query "select * from departments"

--New load

sqoop import \

--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \

--username=retail_dba \

--password=cloudera \

--table departments \

--as-textfile \

--target-dir=/user/cloudera/sqoop_merge/departments_delta \

--where "department_id >= 9000"

hadoop fs -cat /user/cloudera/sqoop_merge/departments_delta/part*

--Merge

sqoop merge --merge-key department_id \

--new-data /user/cloudera/sqoop_merge/departments_delta \

--onto /user/cloudera/sqoop_merge/departments \

--target-dir /user/cloudera/sqoop_merge/departments_stage \

--class-name departments \

--jar-file

hadoop fs -cat /user/cloudera/sqoop_merge/departments_stage/part*

--Delete old directory

hadoop fs -rm -R /user/cloudera/sqoop_merge/departments

--Move/rename stage directory to original directory

hadoop fs -mv /user/cloudera/sqoop_merge/departments_stage /user/cloudera/sqoop_merge/departments

--Validate that original directory have merged data

hadoop fs -cat /user/cloudera/sqoop_merge/departments/part*

--Merge process ends

sqoop import --connect "jdbc:mysql://localhost/world" --username=root --password=cloudera --query="select City.*,CountryLanguage.Percentage,CountryLanguage.IsOfficial from City join CountryLanguage on City.CountryCode = CountryLanguage.CountryCode where City.ID=4079 and \$CONDITIONS" --target-dir /user/cloudera/order_join -m 1

Friday, February 14, 2020

Hbase basic operation

hbase-shell

create 'blog', 'info', 'content'
list
put 'blog', '20130320162535', 'info:title', 'Why use HBase?'
put 'blog', '20130320162535', 'info:author', 'Jane Doe'
put 'blog', '20130320162535', 'info:category', 'Persistence'
put 'blog', '20130320162535', 'content:pg', 'HBase is a column-oriented...'
put 'blog', '20130320162535', 'content:pg1', 'HBase is a column-oriented...'

get 'blog', '20130320162535'
scan 'blog', { STARTROW => '20130300', STOPROW => '20130400' }

delete 'blog', '20130320162535', 'info:category'
get 'blog', '20130320162535'

scan 'blog', { STARTROW => '20130300', STOPROW => '20130500', COLUMNS => 'info:title' }

disable 'blog'
drop 'emp'

exists 'emp'

create 'city','cityinfo','otherdetails'

sqoop import \
--connect jdbc:mysql://localhost/world \
--username root -P \
--table City \
--columns "id,name,countryCode" \
--hbase-table city \
--column-family cityinfo \
--hbase-row-key id -m 1

Wednesday, February 12, 2020

Hive Brushup

Hive Queries Vs Dataframe Queries - Part 1
hadoop@hadoop:~/Desktop/vow$ touch emp.txt
hadoop@hadoop:~/Desktop/vow$ atom emp.txt

101,Sathya,1000
102,Shanthi,2000
103,Mani,3000
104,Kalai,4000
105,Aruvi,5000
106,Nila,1500
107,Praveen,2500
108,Rashee,7500
109,Pinki,3500
110,Ravi,2500

pwd : /home/hadoop/Desktop/vow

hive> create database learning;
OK
Time taken: 0.901 seconds

hive> use learning;
OK
Time taken: 0.08 seconds

hive> create external table emp(id int, name varchar(50), salary int) row format delimited fields terminated by ',';

hive> load data local inpath '/home/hadoop/Desktop/vow/emp.txt' into table emp;

hive> select * from emp;
OK
101 Sathya 1000
102 Shanthi 2000
103 Mani 3000
104 Kalai 4000
105 Aruvi 5000
106 Nila 1500
107 Praveen 2500
108 Rashee 7500
109 Pinki 3500
110 Ravi 2500
Time taken: 0.305 seconds, Fetched: 10 row(s)

scala> val empSchema = StructType(StructField("id",IntegerType,true)::StructField("name",StringType,true)::StructField("salary",IntegerType,true)::Nil)
empSchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(salary,IntegerType,true))

scala> val df = spark.read.format("csv").option("header","false").schema(empSchema).load("/home/hadoop/Desktop/vow/emp.txt");
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> df.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- salary: integer (nullable = true)

hive> select * from emp;
OK
101 Sathya 1000
102 Shanthi 2000
103 Mani 3000
104 Kalai 4000
105 Aruvi 5000
106 Nila 1500
107 Praveen 2500
108 Rashee 7500
109 Pinki 3500
110 Ravi 2500

scala> df.show
+---+-------+------+
| id| name|salary|
+---+-------+------+
|101| Sathya| 1000|
|102|Shanthi| 2000|
|103| Mani| 3000|
|104| Kalai| 4000|
|105| Aruvi| 5000|
|106| Nila| 1500|
|107|Praveen| 2500|
|108| Rashee| 7500|
|109| Pinki| 3500|
|110| Ravi| 2500|
+---+-------+------+

hive> select max(salary) from emp;
7500

scala> df.select(max(df("salary")) as "Salary").show
or
scala> df.select(max($"salary") as "Salary").show

+------+
|Salary|
+------+
| 7500|
+------+

hive> select max(salary),min(salary) from emp;
OK
7500 1000

scala> df.select(max(df("salary")) as "MaxSal", min(df("salary")) as "MinSal").show
df.select(max($"salary") as "MaxSal",min($"salary") as "MinSal").show
+------+------+
|MaxSal|MinSal|
+------+------+
| 7500| 1000|
+------+------+

hive> select salary from emp order by salary;
OK
1000
1500
2000
2500
2500
3000
3500
4000
5000
7500

scala> df.select(df("salary")).orderBy("salary").show
df.select($"salary").orderBy($"salary").show
+------+
|salary|
+------+
| 1000|
| 1500|
| 2000|
| 2500|
| 2500|
| 3000|
| 3500|
| 4000|
| 5000|
| 7500|
+------+

hive> select salary from emp order by salary desc;
OK
7500
5000
4000
3500
3000
2500
2500
2000
1500
1000

import org.apache.spark.sql.functions._

scala> df.select(df("salary")).orderBy(desc("salary")).show
or
scala> df.select($"salary").orderBy($"salary".desc).show
+------+
|salary|
+------+
| 7500|
| 5000|
| 4000|
| 3500|
| 3000|
| 2500|
| 2500|
| 2000|
| 1500|
| 1000|
+------+

hive> select sum(salary) from emp;
OK
32500

scala> df.select(sum("salary") as "Sum").show
or
scala> df.select(sum($"salary") as "Sum").show

+-----+
| Sum|
+-----+
|32500|
+-----+

hadoop@hadoop:~$ touch emp.txt
hadoop@hadoop:~$ atom emp.txt

id,name,gender,salary,deptid
100,Ravi,m,1000,10
101,Rani,f,2000,11
102,Suresh,m,3000,12
103,Rahul,m,1250,10
104,Rashee,f,3500,11
105,Priya,f,3600,12
106,Ayeesha,f,4000,10
107,Aruvi,f,2500,11
108,Arushi,f,2800,12
109,Vinay,m,3200,10
110,Kalai,f,2550,11
111,Shilpa,f,2600,12

hadoop@hadoop:~$ atom dept.txt
hadoop@hadoop:~$ atom dept.txt

deptid,deptname
10,Marketing
11,Sales
12,Production

// tblproperties("skip.header.line.count"="1"); --> which skips the header line

hive> create external table emp(id int, name varchar(50),gender char(1), salary int, deptid int) row format delimited fields terminated by ',' tblproperties("skip.header.line.count"="1");

hive> load data local inpath "/home/hadoop/Desktop/vow/emp.txt" into table emp;

hive> create external table dept(deptid int, deptname varchar(50)) row format delimited fields terminated by ',' tblproperties("skip.header.line.count"="1");

hive> load data local inpath "/home/hadoop/Desktop/vow/dept.txt" into table dept;

scala> val empSchema = StructType(StructField("id",IntegerType,true)::StructField("name",StringType,true)::StructField("gender",StringType,true)::StructField("salary",IntegerType,true)::StructField("deptid",IntegerType,true)::Nil)
empSchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true), StructField(gender,StringType,true), StructField(salary,IntegerType,true), StructField(deptid,IntegerType,true))

scala> scala> val deptSchema = StructType(StructField("deptid",IntegerType,true)::StructField("deptname",StringType,true)::Nil)
deptSchema: org.apache.spark.sql.types.StructType = StructType(StructField(deptid,IntegerType,true), StructField(deptname,StringType,true))

val dfEmp = spark.read.format("csv").option("header","true").schema(empSchema).load("/home/hadoop/Desktop/vow/emp.txt");

val dfDept = spark.read.format("csv").option("header","true").schema(deptSchema).load("/home/hadoop/Desktop/vow/dept.txt");

hive> select * from emp;
OK
100 Ravi m 1000 10
101 Rani f 2000 11
102 Suresh m 3000 12
103 Rahul m 1250 10
104 Rashee f 3500 11
105 Priya f 3600 12
106 Ayeesha f 4000 10
107 Aruvi f 2500 11
108 Arushi f 2800 12
109 Vinay m 3200 10
110 Kalai f 2550 11
111 Shilpa f 2600 12

scala> dfEmp.show
or
scala> dfEmp.select("*").show

+---+-------+------+------+------+
| id| name|gender|salary|deptid|
+---+-------+------+------+------+
|100| Ravi| m| 1000| 10|
|101| Rani| f| 2000| 11|
|102| Suresh| m| 3000| 12|
|103| Rahul| m| 1250| 10|
|104| Rashee| f| 3500| 11|
|105| Priya| f| 3600| 12|
|106|Ayeesha| f| 4000| 10|
|107| Aruvi| f| 2500| 11|
|108| Arushi| f| 2800| 12|
|109| Vinay| m| 3200| 10|
|110| Kalai| f| 2550| 11|
|111| Shilpa| f| 2600| 12|
+---+-------+------+------+------+

hive> select * from dept;
OK
10 Marketing
11 Sales
12 Production
Time taken: 0.238 seconds, Fetched: 3 row(s)

scala> dfDept.show
or
scala> dfDept.select("*").show

+------+----------+
|deptid| deptname|
+------+----------+
| 10| Marketing|
| 11| Sales|
| 12|Production|
+------+----------+

scala> dfEmp.select(max($"salary") as "MaxSal").show
+------+
|MaxSal|
+------+
| 4000|
+------+

hive> select max(salary) from emp;
4000

scala> dfEmp.select(min($"salary") as "MaxSal").show
+------+
|MaxSal|
+------+
| 1000|
+------+

hive> select min(salary) from emp;
1000

hive> select max(salary) as MaxSal, min(salary) as MinSal from emp;
4000 1000

scala> dfEmp.select(max("salary") as "MaxSal",min("salary") as "MinSal").show
+------+------+
|MaxSal|MinSal|
+------+------+
| 4000| 1000|
+------+------+

hive> select deptid,max(salary) from emp group by deptid order by deptid;
10 4000
11 3500
12 3600

scala> dfEmp.groupBy("deptid").agg(max("salary") as "maxSal").orderBy("deptid").show
+------+------+
|deptid|maxSal|
+------+------+
| 10| 4000|
| 11| 3500|
| 12| 3600|
+------+------+

hive> select deptid,count(name) from emp group by deptid order by deptid;
10 4
11 4
12 4

scala> dfEmp.groupBy("deptid").agg(count("name") as "nameCount").orderBy("deptid").show
+------+---------+
|deptid|nameCount|
+------+---------+
| 10| 4|
| 11| 4|
| 12| 4|
+------+---------+

scala> dfEmp.select($"salary").orderBy("salary").show
or
scala> dfEmp.select(dfEmp("salary")).orderBy("salary").show

+------+
|salary|
+------+
| 1000|
| 1250|
| 2000|
| 2500|
| 2550|
| 2600|
| 2800|
| 3000|
| 3200|
| 3500|
| 3600|
| 4000|
+------+

select salary from emp order by salary
1000
1250
2000
2500
2550
2600
2800
3000
3200
3500
3600
4000

scala> dfEmp.select($"salary").orderBy(desc("salary")).show
or
dfEmp.select("salary").orderBy(desc("salary")).show
+------+
|salary|
+------+
| 4000|
| 3600|
| 3500|
| 3200|
| 3000|
| 2800|
| 2600|
| 2550|
| 2500|
| 2000|
| 1250|
| 1000|
+------+

hive> select salary from emp order by salary desc;
4000
3600
3500
3200
3000
2800
2600
2550
2500
2000
1250
1000

hive> select gender,max(salary) from emp group by gender order by gender;
f 4000
m 3200

scala> dfEmp.groupBy("gender").agg(max("salary") as "maxSal").orderBy("gender").show
+------+------+
|gender|maxSal|
+------+------+
| f| 4000|
| m| 3200|
+------+------+

hive> select gender,sum(salary) from emp group by gender order by gender;
f 23550
m 8450

scala> dfEmp.groupBy("gender").agg(sum("salary") as "GenderSumSal").orderBy("gender").show
+------+------------+
|gender|GenderSumSal|
+------+------------+
| f| 23550|
| m| 8450|
+------+------------+

hive> select * from emp order by salary desc;
OK
106 Ayeesha f 4000 10
105 Priya f 3600 12
104 Rashee f 3500 11
109 Vinay m 3200 10
102 Suresh m 3000 12
108 Arushi f 2800 12
111 Shilpa f 2600 12
110 Kalai f 2550 11
107 Aruvi f 2500 11
101 Rani f 2000 11
103 Rahul m 1250 10
100 Ravi m 1000 10

scala> dfEmp.orderBy(desc("salary")).show
+---+-------+------+------+------+
| id| name|gender|salary|deptid|
+---+-------+------+------+------+
|106|Ayeesha| f| 4000| 10|
|105| Priya| f| 3600| 12|
|104| Rashee| f| 3500| 11|
|109| Vinay| m| 3200| 10|
|102| Suresh| m| 3000| 12|
|108| Arushi| f| 2800| 12|
|111| Shilpa| f| 2600| 12|
|110| Kalai| f| 2550| 11|
|107| Aruvi| f| 2500| 11|
|101| Rani| f| 2000| 11|
|103| Rahul| m| 1250| 10|
|100| Ravi| m| 1000| 10|
+---+-------+------+------+------+

hive> select * from emp order by salary desc limit 2;
OK
106 Ayeesha f 4000 10
105 Priya f 3600 12

scala> dfEmp.orderBy(desc("salary")).show(2);
+---+-------+------+------+------+
| id| name|gender|salary|deptid|
+---+-------+------+------+------+
|106|Ayeesha| f| 4000| 10|
|105| Priya| f| 3600| 12|
+---+-------+------+------+------+
only showing top 2 rows

// top salaried person
hive> select * from emp order by salary desc limit 1;
106 Ayeesha f 4000 10

scala> dfEmp.orderBy(desc("salary")).show(1);
+---+-------+------+------+------+
| id| name|gender|salary|deptid|
+---+-------+------+------+------+
|106|Ayeesha| f| 4000| 10|
+---+-------+------+------+------+
only showing top 1 row

//extract single value (scalar) from dataframe
scala> val x:Int = dfEmp.agg(max("salary")).head().getInt(0)
x: Int = 4000

scala> dfEmp.orderBy(desc("salary")).show
+---+-------+------+------+------+
| id| name|gender|salary|deptid|
+---+-------+------+------+------+
|106|Ayeesha| f| 4000| 10|
|105| Priya| f| 3600| 12|
|104| Rashee| f| 3500| 11|
|109| Vinay| m| 3200| 10|
|102| Suresh| m| 3000| 12|
|108| Arushi| f| 2800| 12|
|111| Shilpa| f| 2600| 12|
|110| Kalai| f| 2550| 11|
|107| Aruvi| f| 2500| 11|
|101| Rani| f| 2000| 11|
|103| Rahul| m| 1250| 10|
|100| Ravi| m| 1000| 10|
+---+-------+------+------+------+

scala> dfEmp.where($"salary" < dfEmp.agg(max("salary")).first().getInt(0)).orderBy(desc("salary")).show(1)
+---+-----+------+------+------+
| id| name|gender|salary|deptid|
+---+-----+------+------+------+
|105|Priya| f| 3600| 12|
+---+-----+------+------+------+
only showing top 1 row

// 2nd maximum salaried person
hive> select * from emp where salary not in (select max(salary) from emp ) order by salary desc limit 1;
105 Priya f 3600 12

hive> select * from (select * from emp sort by salary desc limit 2) result sort by salary limit 1;

105 Priya f 3600 12

scala> dfEmp.orderBy(desc("Salary")).limit(2).orderBy("salary").show(1);
+---+-----+------+------+------+
| id| name|gender|salary|deptid|
+---+-----+------+------+------+
|105|Priya| f| 3600| 12|
+---+-----+------+------+------+
only showing top 1 row

scala> dfEmp.orderBy(desc("Salary")).take(2)

res87: Array[org.apache.spark.sql.Row] = Array([106,Ayeesha,f,4000,10], [105,Priya,f,3600,12])

scala> dfEmp.orderBy(desc("Salary")).take(2)(1);

res91: org.apache.spark.sql.Row = [105,Priya,f,3600,12]

Experimenting with ORC file - Parsing Yahoo Stocks using Spark with Scala

Input

yahoo_stocks.csv:

------------------

Date,Open,High,Low,Close,Volume,Adj Close

2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34

2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36

2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52

2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70

2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98

2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49

2015-04-20,44.73,44.91,44.41,44.66,10052900,44.66

2015-04-17,45.30,45.44,44.25,44.45,13305700,44.45

2015-04-16,45.82,46.13,45.53,45.78,13800300,45.78

2015-04-15,45.46,45.83,45.23,45.73,15033500,45.73

hadoop@hadoop:~/Desktop/vow$ hdfs dfs -copyFromLocal yahoo_stocks.csv /user/

hadoop@hadoop:~/Desktop/vow$ hdfs dfs -head /user/yahoo_stocks.csv

Date,Open,High,Low,Close,Volume,Adj Close

2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34

2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36

2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52

2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70

2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98

2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49

scala> import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.SQLContext

scala> import org.apache.spark.sql.types

import org.apache.spark.sql.types

scala> import org.apache.spark.sql._

import org.apache.spark.sql._

scala> import spark.implicits._

import spark.implicits._

scala> spark.sql("CREATE TABLE yahoo_orc_table (date STRING, open_price FLOAT, high_price FLOAT, low_price FLOAT, close_price FLOAT, volume INT, adj_price FLOAT) stored as orc")

2019-02-01 20:59:04 WARN HiveMetaStore:1383 - Location: hdfs://localhost:9000/user/hive/warehouse/yahoo_orc_table specified for non-external table:yahoo_orc_table

res39: org.apache.spark.sql.DataFrame = []

scala> val stocks = sc.textFile("hdfs://localhost:9000/user/yahoo_stocks.csv")

stocks: org.apache.spark.rdd.RDD[String] = hdfs://localhost:9000/user/yahoo_stocks.csv MapPartitionsRDD[109] at textFile at :42

scala> stocks.take(5).foreach(println)

Date,Open,High,Low,Close,Volume,Adj Close

2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34

2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36

2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52

2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70

scala> val header = stocks.first

header: String = Date,Open,High,Low,Close,Volume,Adj Close

scala> val rddStocksWithoutHeader = stocks.filter(x => x != header)

rddStocksWithoutHeader: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[110] at filter at :45

scala> rddStocksWithoutHeader.take(5).foreach(println)

2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34

2015-04-27,44.65,45.10,44.25,44.36,10840900,44.36

2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52

2015-04-23,43.92,44.06,43.58,43.70,14274900,43.70

2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98

scala> case class YahooStockPrice (date:String, open:Float, high:Float,low:Float,close:Float,volume:Integer,adjcClose:Float)

defined class YahooStockPrice

scala> val stockpricewithSchema = rddStocksWithoutHeader.map (x => {

| val fields = x.split(",")

| val date = fields(0)

| val open = fields(1).trim.toFloat

| val high = fields(2).trim.toFloat

| val low = fields(3).trim.toFloat

| val close = fields(4).trim.toFloat

| val volume = fields(5).trim.toInt

| val adjClose = fields(6).trim.toFloat

| YahooStockPrice(date,open,high,low,close,volume,adjClose)

| })

stockpricewithSchema: org.apache.spark.rdd.RDD[YahooStockPrice] = MapPartitionsRDD[111] at map at :45

scala> stockpricewithSchema.take(5).foreach(println)

YahooStockPrice(2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34)

YahooStockPrice(2015-04-27,44.65,45.1,44.25,44.36,10840900,44.36)

YahooStockPrice(2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52)

YahooStockPrice(2015-04-23,43.92,44.06,43.58,43.7,14274900,43.7)

YahooStockPrice(2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98)

scala> val dfStockPrice = stockpricewithSchema.toDF

dfStockPrice: org.apache.spark.sql.DataFrame = [date: string, open: float ... 5 more fields]

scala> dfStockPrice.printSchema

root

|-- date: string (nullable = true)

|-- open: float (nullable = false)

|-- high: float (nullable = false)

|-- low: float (nullable = false)

|-- close: float (nullable = false)

|-- volume: integer (nullable = true)

|-- adjcClose: float (nullable = false)

scala> dfStockPrice.show(5)

+----------+-----+-----+-----+-----+--------+---------+

+----------+-----+-----+-----+-----+--------+---------+

|2015-04-28|44.34|44.57|43.94|44.34| 7188300| 44.34|

|2015-04-27|44.65| 45.1|44.25|44.36|10840900| 44.36|

|2015-04-24|43.73|44.71|43.69|44.52|11267500| 44.52|

|2015-04-23|43.92|44.06|43.58| 43.7|14274900| 43.7|

|2015-04-22|44.58|44.85|43.67|43.98|32241200| 43.98|

+----------+-----+-----+-----+-----+--------+---------+

only showing top 5 rows

scala> dfStockPrice.createOrReplaceTempView("ystemp")

scala> val results = spark.sql("SELECT * FROM ystemp limit 10")

results: org.apache.spark.sql.DataFrame = [date: string, open: float ... 5 more fields]

scala> results.show

+----------+-----+-----+-----+-----+--------+---------+

+----------+-----+-----+-----+-----+--------+---------+

|2015-04-28|44.34|44.57|43.94|44.34| 7188300| 44.34|

|2015-04-27|44.65| 45.1|44.25|44.36|10840900| 44.36|

|2015-04-24|43.73|44.71|43.69|44.52|11267500| 44.52|

|2015-04-23|43.92|44.06|43.58| 43.7|14274900| 43.7|

|2015-04-22|44.58|44.85|43.67|43.98|32241200| 43.98|

|2015-04-21|45.15|45.18|44.45|44.49|16103700| 44.49|

|2015-04-20|44.73|44.91|44.41|44.66|10052900| 44.66|

|2015-04-17| 45.3|45.44|44.25|44.45|13305700| 44.45|

|2015-04-16|45.82|46.13|45.53|45.78|13800300| 45.78|

|2015-04-15|45.46|45.83|45.23|45.73|15033500| 45.73|

+----------+-----+-----+-----+-----+--------+---------+

scala> results.map(t => "Stock Entry: " + t.toString).collect().foreach(println)

Stock Entry: [2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34]

Stock Entry: [2015-04-27,44.65,45.1,44.25,44.36,10840900,44.36]

Stock Entry: [2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52]

Stock Entry: [2015-04-23,43.92,44.06,43.58,43.7,14274900,43.7]

Stock Entry: [2015-04-22,44.58,44.85,43.67,43.98,32241200,43.98]

Stock Entry: [2015-04-21,45.15,45.18,44.45,44.49,16103700,44.49]

Stock Entry: [2015-04-20,44.73,44.91,44.41,44.66,10052900,44.66]

Stock Entry: [2015-04-17,45.3,45.44,44.25,44.45,13305700,44.45]

Stock Entry: [2015-04-16,45.82,46.13,45.53,45.78,13800300,45.78]

Stock Entry: [2015-04-15,45.46,45.83,45.23,45.73,15033500,45.73]

scala> val results = spark.sql("SELECT * FROM ystemp")

results: org.apache.spark.sql.DataFrame = [date: string, open: float ... 5 more fields]

scala> results.write.format("orc").save("yahoo_stocks_orc")

scala> val yahoo_stocks_orc = spark.read.format("orc").load("yahoo_stocks_orc")

yahoo_stocks_orc: org.apache.spark.sql.DataFrame = [date: string, open: float ... 5 more fields]

scala> yahoo_stocks_orc.createOrReplaceTempView("orcTest")

scala> spark.sql("SELECT * from orcTest").collect.take(3).foreach(println)

[2015-04-28,44.34,44.57,43.94,44.34,7188300,44.34]

[2015-04-27,44.65,45.1,44.25,44.36,10840900,44.36]

[2015-04-24,43.73,44.71,43.69,44.52,11267500,44.52]

Hive and Spark Integration

Integrating Hive with Spark

//Start Hive and create a new Database : School and Create a new Table : Student and add 3 records

hive> show databases;

default

Time taken: 0.832 seconds, Fetched: 1 row(s)

hive> create database School ;

Time taken: 0.343 seconds

hive> use School;

Time taken: 0.045 seconds

hive> create table Student(id int, name varchar(50));

Time taken: 0.685 seconds

hive> insert into Student (id,name) values(101,'Sankar');

insert into Student (id,name) values(102,"Zee");

insert into Student (id,name) values(103,"Maha");

hive> select * from Student;

101 Sankar

102 Zee

103 Maha

Time taken: 0.261 seconds, Fetched: 3 row(s)

// Start Spark and do the following to access Hive Database (School), and Hive Table (Student)

scala> spark.sql("use School")

scala> spark.sql("select * from Student").show()

+---+------+

| id| name|

+---+------+

|101|Sankar|

|102| Zee|

|103| Maha|

+---+------+

How to fix metastore issue in Hive and MySQL?

run mysql...

sudo mysql

drop database metastore

$ schematool -dbType mysql -initSchema

$ schematool -dbType mysql -info

Spark RDD to Dataframe To Hive Table (Spark-SQL with Hive integration)

//Hide log messages, warning messages in spark-shell

import org.apache.log4j.Logger

import org.apache.log4j.Level

Logger.getLogger("org").setLevel(Level.OFF)

Logger.getLogger("akka").setLevel(Level.OFF)

Input file

weblog.txt:

-------------

3.94.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.33.140.62 - 21475 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.31.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

3.91.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.32.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.35.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

3.93.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

12.38.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

12.38.140.62 - 4712 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

13.94.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

11.38.140.62 - 4712 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

12.38.140.62 - 4712 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

scala> val rdd1 = sc.textFile("/home/hadoop/Desktop/weblog.txt")

rdd1: org.apache.spark.rdd.RDD[String] = /home/hadoop/Desktop/weblog.txt MapPartitionsRDD[50] at textFile at :29

scala> rdd1.count

res15: Long = 12

scala> rdd1.take(5).foreach(println)

3.94.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.33.140.62 - 21475 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.31.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

3.91.78.5 - 69827 [15/Sep/2013:23:58:36 +0100] "GET /KBDOC-00033.html HTTP/1.0"

19.32.140.62 - 2489 [15/Sep/2013:23:58:34 +0100] "GET /KBDOC-00033.html HTTP/1.0"

scala> val ip_Pattern = "[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}".r

ip_Pattern: scala.util.matching.Regex = - [0-9]{1,5}

scala> val port_Pattern =" - [0-9]{1,5}".r

port_pattern: scala.util.matching.Regex = - [0-9]{1,5}

scala> val time_Pattern = "[0-9]{1,2}/[A-Z][a-z]{2}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} [+][0-9]{4}".r

time_Pattern: scala.util.matching.Regex = [0-9]{1,2}/[A-Z][a-z]{2}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} [+][0-9]{4}

scala> val rdd2 = rdd1.map { x =>

val ip = ip_Pattern.findFirstIn(x).get

val port = port_Pattern.findFirstIn(x).get

val port1 = port.slice(2,port.length).trim().toInt

val time = time_Pattern.findFirstIn(x).get

(ip,port1,time)

}

rdd2: org.apache.spark.rdd.RDD[(String, Int, String)] = MapPartitionsRDD[2] at map at :33

scala> rdd2.take(5).foreach(println)

(3.94.78.5,69827,15/Sep/2013:23:58:36 +0100)

(19.33.140.62,21475,15/Sep/2013:23:58:34 +0100)

(19.31.140.62,2489,15/Sep/2013:23:58:34 +0100)

(3.91.78.5,69827,15/Sep/2013:23:58:36 +0100)

(19.32.140.62,2489,15/Sep/2013:23:58:34 +0100)

scala> val df1 = rdd2.toDF("ipAddress","portNo","TimeStamp")

df1: org.apache.spark.sql.DataFrame = [ipAddress: string, portNo: int ... 1 more field]

scala> df1.show

+------------+------+--------------------+

| ipAddress|portNo| TimeStamp|

+------------+------+--------------------+

| 3.94.78.5| 69827|15/Sep/2013:23:58...|

|19.33.140.62| 21475|15/Sep/2013:23:58...|

|19.31.140.62| 2489|15/Sep/2013:23:58...|

| 3.91.78.5| 69827|15/Sep/2013:23:58...|

|19.32.140.62| 2489|15/Sep/2013:23:58...|

|19.35.140.62| 2489|15/Sep/2013:23:58...|

| 3.93.78.5| 69827|15/Sep/2013:23:58...|

|12.38.140.62| 2489|15/Sep/2013:23:58...|

|12.38.140.62| 4712|15/Sep/2013:23:58...|

| 13.94.78.5| 69827|15/Sep/2013:23:58...|

|11.38.140.62| 4712|15/Sep/2013:23:58...|

|12.38.140.62| 4712|15/Sep/2013:23:58...|

+------------+------+--------------------+

scala> val df2 = df1.select ("*").groupBy("ipAddress").agg(count("*"))

df2: org.apache.spark.sql.DataFrame = [ipAddress: string, count(1): bigint]

scala> df2.show

+------------+--------+

| ipAddress|count(1)|

+------------+--------+

|12.38.140.62| 3|

|19.31.140.62| 1|

| 3.94.78.5| 1|

| 3.93.78.5| 1|

|19.35.140.62| 1|

|11.38.140.62| 1|

|19.33.140.62| 1|

| 3.91.78.5| 1|

| 13.94.78.5| 1|

|19.32.140.62| 1|

+------------+--------+

scala> val df2 = df1.select ("*").groupBy("ipAddress").agg(count("*") as "count")

df2: org.apache.spark.sql.DataFrame = [ipAddress: string, count: bigint]

scala> df2.show

+------------+-----+

| ipAddress|count|

+------------+-----+

|12.38.140.62| 3|

|19.31.140.62| 1|

| 3.94.78.5| 1|

| 3.93.78.5| 1|

|19.35.140.62| 1|

|11.38.140.62| 1|

|19.33.140.62| 1|

| 3.91.78.5| 1|

| 13.94.78.5| 1|

|19.32.140.62| 1|

+------------+-----+

scala> df.write.saveAsTable("joy.outstanding")

// joy is the database which is present in Hive

scala> spark.sql("use joy");

res6: org.apache.spark.sql.DataFrame = []

// fetching Hive table info within Spark

scala> spark.sql("select * from outstanding").show

+------------+-----+

| ipAddress|count|

+------------+-----+

|12.38.140.62| 3|

|19.31.140.62| 1|

| 3.94.78.5| 1|

| 3.93.78.5| 1|

|19.35.140.62| 1|

|11.38.140.62| 1|

|19.33.140.62| 1|

| 3.91.78.5| 1|

| 13.94.78.5| 1|

|19.32.140.62| 1|

+------------+-----+

MySQL to Hive using SQOOP import-all-tables

[cloudera@quickstart ~]$ sqoop import-all-tables --num-mappers 1 --connect jdbc:mysql://localhost/world --username root --password cloudera --hive-import --hive-overwrite --create-hive-table --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --outdir java_files;

chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/city': User does not belong to supergroup

chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/country': User does not belong to supergroup

Loading data to table default.city

Loading data to table default.country

hive> show tables;

city

country

Time taken: 0.013 seconds, Fetched: 2 row(s)

hive> select * from city limit 10;

1 Kabul AFG Kabol 1780000

2 Qandahar AFG Qandahar 237500

3 Herat AFG Herat 186800

4 Mazar-e-Sharif AFG Balkh 127800

5 Amsterdam NLD Noord-Holland 731200

6 Rotterdam NLD Zuid-Holland 593321

7 Haag NLD Zuid-Holland 440900

8 Utrecht NLD Utrecht 234323

9 Eindhoven NLD Noord-Brabant 201843

10 Tilburg NLD Noord-Brabant 193238

Time taken: 0.614 seconds, Fetched: 10 row(s)

describe formatted city;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/city

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/city

Found 2 items

-rwxrwxrwx 1 cloudera cloudera 0 2018-12-28 06:31 /user/hive/warehouse/city/_SUCCESS

-rwxrwxrwx 1 cloudera cloudera 93338 2018-12-28 06:31 /user/hive/warehouse/city/part-m-00000.snappy

Hive Notes - Part 3

Array:

collection of homogenous items

qual = ['BTech','MTech','PhD']

purpose of each element is same

prices = [23.32,54.23,34.23]

Struct:

Hetrogeneous collection items (Tuple in Pig)

info =

purpose of each element is different

Map:

collection of Key, Value pairs

qual =

name -> string

age -> int

sal -> double

city -> string

wife -> struct

qual -> array

exp -> map

[cloudera@quickstart ~]$ cat > file1.txt

100,200,300

200,500,900

11,12,12

1,2,3

hive> show databases;

default

fbi_crime_db

practice

Time taken: 0.732 seconds, Fetched: 3 row(s)

hive> use practice;

Time taken: 0.149 seconds

hive> show tables;

couples

infos

profile1

profile2

profile3

qtab

res

strans

trans

tras

Time taken: 0.079 seconds, Fetched: 10 row(s)

hive> describe profile1;

id int

name string

age int

qual array

city string

Time taken: 0.177 seconds, Fetched: 5 row(s)

hive> describe formatted profile1;

# col_name data_type comment

id int

name string

age int

qual array

city string

# Detailed Table Information

Database: practice

Owner: cloudera

CreateTime: Fri Nov 16 19:24:59 PST 2018

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/practice.db/profile1

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true

numFiles 1

totalSize 62

transient_lastDdlTime 1542425232

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

colelction.delim #

field.delim ,

serialization.format ,

Time taken: 0.104 seconds, Fetched: 35 row(s)

hive> describe extended profile1;

id int

name string

age int

qual array

city string

Detailed Table Information Table(tableName:profile1, dbName:practice, owner:cloudera, createTime:1542425099, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:qual, type:array, comment:null), FieldSchema(name:city, type:string, comment:null)], location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/practice.db/profile1, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{colelction.delim=#, serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, transient_lastDdlTime=1542425232, COLUMN_STATS_ACCURATE=true, totalSize=62}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

Time taken: 0.091 seconds, Fetched: 7 row(s)

hive> create table info1(a int, b int, c int);

Time taken: 0.213 seconds

hive> load data local inpath 'file1.txt' into table info1;

Loading data to table practice.info1

Table practice.info1 stats: [numFiles=1, totalSize=39]

Time taken: 1.364 seconds

hive> select * from info1;

NULL NULL NULL

Time taken: 0.458 seconds, Fetched: 4 row(s)

Delimiter:

Default delimiter for hive is Ctrl+A - '\001' - diamond symbol

But the file has zero \001s.

so entire line is treated as single column.

this entire line is a string which can't be fit into 1st column a which is int.

so 'a' became null.

as per '\001' delimiter, there are no 2nd and 3rd fields in the file.

thats why a,b,c columns became null,null,null.

solution.

change the delimiter of table.

hive> create table info2(a int, b int, c int) row format delimited fields terminated by ',';

Time taken: 0.06 seconds

hive> load data local inpath 'file1.txt' into table info2;

Loading data to table practice.info2

Table practice.info2 stats: [numFiles=1, totalSize=39]

Time taken: 0.24 seconds

hive> select * from info2;

100 200 300

200 500 900

11 12 12

1 2 3

Time taken: 0.077 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/info2

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 39 2018-11-18 18:52 /user/hive/warehouse/practice.db/info2/file1.txt

[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/practice.db/info2/file1.txt

100,200,300

200,500,900

11,12,12

1,2,3

Array Example:

---------------

[cloudera@quickstart ~]$ cat > emp.txt

101,aaaa,40000,m,11

102,bbbb,44000,f,12

103,cccc,60000,m,12

hive> create table emp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ',';

Time taken: 0.067 seconds

hive> load data local inpath 'emp.txt' into table emp;

Loading data to table practice.emp

Table practice.emp stats: [numFiles=1, totalSize=60]

Time taken: 0.198 seconds

hive> select * from emp;

101 aaaa 40000 m 11

102 bbbb 44000 f 12

103 cccc 60000 m 12

Time taken: 0.11 seconds, Fetched: 3 row(s)

[cloudera@quickstart ~]$ cat > profile1.txt

101,Ravi,BTech#MTech#PhD,26,Hyd

102,Rani,Bsc#Msc#MTech,27,Del

hive> create table profile1 (id int, name string, qual array, age int,city string) row format delimited fields terminated by ',' collection items terminated by '#';

Time taken: 0.09 seconds

Struct example:

---------------

[cloudera@quickstart ~]$ cat > profile2.txt

Ravi,26,Rani#24#Hyd,Del

Giri,24,Soni#23#Del,Hyd

hive> load data local inpath 'profile1.txt' into table profile1;

Loading data to table practice.profile1

Table practice.profile1 stats: [numFiles=1, totalSize=62]

Time taken: 0.333 seconds

hive> select * from profile1;

101 Ravi ["BTech","MTech","PhD"] 26 Hyd

102 Rani ["Bsc","Msc","MTech"] 27 Del

Time taken: 0.065 seconds, Fetched: 2 row(s)

hive> create table profile2(name string, age int, wife struct, city string) row format delimited fields terminated by ',' collection items terminated by '#';

hive> select * from profile2;

Ravi 26 {"name":"Rani","age":24,"city":"Hyd"} Del

Giri 24 {"name":"Soni","age":23,"city":"Del"} Hyd

Time taken: 0.049 seconds, Fetched: 2 row(s)

Time taken: 0.068 seconds

hive> load data local inpath 'profile2.txt' into table profile2;

Loading data to table practice.profile2

Table practice.profile2 stats: [numFiles=1, totalSize=48]

Time taken: 0.242 seconds

Map example:

------------

[cloudera@quickstart ~]$ cat > profile3.txt

Hari,25,BTech$67#MTech$57#MBA$89,Hyd

Mani,26,Bsc$90#Msc$80#MBA$70,Del

hive> create table profile3(name string, age int, qual map,city string) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by '$';

Time taken: 0.085 seconds

hive> load data local inpath 'profile3.txt' into table profile3;

Loading data to table practice.profile3

Table practice.profile3 stats: [numFiles=1, totalSize=70]

Time taken: 0.163 seconds

hive> select * from profile3;

Hari 25 {"BTech":67,"MTech":57,"MBA":89} Hyd

Mani 26 {"Bsc":90,"Msc":80,"MBA":70} Del

Create a text file with the following contents:

-----------------------------------------------

[cloudera@quickstart ~]$ cat > prof1

101,Ravi,30,BTech#MTech#PhD,Hyd

102,Rani,25,Bsc#Msc#MTech,Del

create a database in hive:

---------------------------

hive> create database practice;

Time taken: 0.183 seconds

Open a database:

-----------------

hive> use practice;

create a table in hive:

-------------------------

hive> create table profile1 (id int, name string, age int, qual array,city string) row format delimited fields terminated by ',' collection items terminated by '#';

load data from text file:

-------------------------

load data local inpath 'prof1' into table profile1;

Get the structure of a table:

-----------------------------

hive> describe profile1;

id int

name string

age int

qual array

city string

Time taken: 0.208 seconds, Fetched: 5 row(s)

Get more detailed description of a table:

-----------------------------------------

describe formatted profile1;

# col_name data_type comment

id int

name string

age int

qual array

city string

# Detailed Table Information

Database: practice

Owner: cloudera

CreateTime: Fri Nov 16 19:24:59 PST 2018

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/practice.db/profile1

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true

numFiles 1

totalSize 62

transient_lastDdlTime 1542425232

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

colelction.delim #

field.delim ,

serialization.format ,

Time taken: 0.12 seconds, Fetched: 35 row(s)

// This is internal table so, it is stored in : user/hive/warehouse/practice.db/profile1

hive> select * from profile1;

101 Ravi 30 ["BTech","MTech","PhD"] Hyd

102 Rani 25 ["Bsc","Msc","MTech"] Del

Time taken: 0.654 seconds, Fetched: 2 row(s)

// retrieve first qualification

hive> select name,qual[0] from profile1;

Ravi BTech

Rani Bsc

// size(qual) --> total number of elements...

hive> select name, size(qual) from profile1;

Ravi 3

Rani 3

// total number of characters in their names

select length(name) from profile1;

Size is applied on Array. Length is applied on string

get Qualifications:

-------------------

hive> select qual from profile1;

["BTech","MTech","PhD"]

["Bsc","Msc","MTech"]

// How to flatten array elements?

//explode the given array to make flattened strings

hive> select explode(qual) from profile1;

BTech

MTech

PhD

Bsc

Msc

MTech

create table qtab(qual string);

Time taken: 0.068 seconds

hive> insert into table qtab select explode(qual) as q from profile1;

hive> select * from qtab;

BTech

MTech

PhD

Bsc

Msc

MTech

MTech - how many people, BTech how many people???

// use the following to identify

select qual,count(*) from qtab group by qual;

BTech 1

Bsc 1

MTech 2

Msc 1

PhD 1

// create a result table:

hive> create table res(qual string, cnt int) row format delimited fields terminated by ',';

Time taken: 0.161 seconds

// insert into select (select query result will be the input)

hive> insert into table res select qual,count(*) from qtab group by qual;

// see the result here

hive> select * from res;

BTech 1

Bsc 1

MTech 2

Msc 1

PhD 1

Hive has 3 types of functions:

------------------------------

i) udf (user defined functions)

substr(),length(),size()

foreach row one value will be returned.

ii) udaf (user defined aggregated functions)

sum, count, min(),max(),avg()

for entire column or for entire group these functions return single value.

iii) udtf (user defined table generated functions)

explode(),json_tuple(),parse_url_tuple()

create sales text file:

-------------------------

[cloudera@quickstart ~]$ cat > sales

c101,10#20#100

c102,100#50

c103,100#50

c101,600#400

what is the bill for 101? (10+20+100+600+400) => 1130

create transaction table:

---------------------------

hive> create table trans (cid string, price array) row format delimited fields terminated by ',' collection items terminated by '#';

Time taken: 0.093 seconds

load data from text file (sales) into transaction table:

-------------------------------------------------

hive>load data local inpath 'sales' into table trans;

// display the content of the table

hive> select * from trans;

c101 [10,20,100]

c102 [100,50]

c103 [100,50]

c101 [600,400]

// clueless explode

hive> select explode(price) as pr from trans;

100

600

400

Time taken: 0.056 seconds, Fetched: 9 row(s)

// along with udtf functions other columns are not allowed.

c101 [20,30] ==>

c101, 20

c101, 30

hive> select cid, mypr from trans lateral view explode(price) p as mypr;

c101 10

c101 20

c101 100

c102 100

c102 50

c103 100

c103 50

c101 600

c101 400

hive> create table strans (cid string, price int);

hive> insert into table strans select cid, mypr from trans lateral view explode(price) p as mypr;

hive> select * from strans;

c101 10

c101 20

c101 100

c102 100

c102 50

c103 100

c103 50

c101 600

c101 400

Cid Price

C101 [10,20]

C102 [30,20,40]

Explode

C101 10

C101 20

Union

Explode

C102 30

C102 20

C102 40

Result:

--------

C101 10

C101 20

C102 30

C102 20

C102 40

select cid,sum(price) from strans group by cid;

c101 1130

c102 150

c103 150

Profile:

--------

[cloudera@quickstart ~]$ cat > prof2

Ravi,30,Rani#25#Del,Hyd

Raghu,35,Sailu#23#Hyd,Del

hive> create table profile2(name string, age int, wife struct,city string);

hive> create table profile2(name string, age int, wife struct, city string) row format delimited fields terminated by ',' collection items terminated by '#';

hive> load data local inpath 'prof2' into table profile2;

hive> select * from profile2;

Ravi 30 {"name":"Rani","age":25,"city":"Del"} Hyd

Raghu 35 {"name":"Sailu","age":23,"city":"Hyd"} Del

hive> describe profile2;

name string

age int

wife struct

city string

Time taken: 0.094 seconds, Fetched: 4 row(s)

hive> select name,age,wife.name,wife.age from profile2;

Ravi 30 Rani 25

Raghu 35 Sailu 23

Time taken: 0.056 seconds, Fetched: 2 row(s)

hive> create table couples (hname string, wname string, hage int, wage int, hcity string, wcity string) row format delimited fields terminated by '\t';

insert into table couples select name, wife.name, age, wife.age, city, wife.city from profile2;

hive> select * from couples;

Ravi Rani 30 25 Hyd Del

Raghu Sailu 35 23 Del Hyd

[cloudera@quickstart ~]$ cat > prof3

Ravi,30,BTech$89#MTech$79#MBA$60,Hyd

Raghu,35,Bsc$80#Msc$70,Del

hive> create table profile3(name string, age int, qual map, city string) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by '$';

hive> load data local inpath 'prof3' into table profile3;

hive> load data local inpath 'prof3' overwrite into table profile3;

hive> select * from profile3;

Ravi 30 {"BTech":89,"MTech":79,"MBA":60} Hyd

Raghu 35 {"Bsc":80,"Msc":70} Del

hive> select name, qual['BTech'] from profile3;

Ravi 89

Raghu NULL

Time taken: 0.056 seconds, Fetched: 2 row(s)

hive> select name,qual["Msc"] from profile3;

Ravi NULL

Raghu 70

Time taken: 0.046 seconds, Fetched: 2 row(s)

// get all values from the map

hive> select map_keys(qual) from profile3;

["BTech","MTech","MBA"]

["Bsc","Msc"]

hive> select name, map_keys(qual),map_values(qual) from profile3;

Ravi ["BTech","MTech","MBA"] [89,79,60]

Raghu ["Bsc","Msc"] [80,70]

hive> show databases;

default

fbi_crime_db

practice

Time taken: 0.689 seconds, Fetched: 3 row(s)

hive> use practice;

create table infos(name string, qual array,percentage array);

insert overwrite table infos select name, map_keys(qual), map_values(qual) from profile3;

hive> select * from infos;

Ravi ["BTech","MTech","MBA"] [89,79,60]

Raghu ["Bsc","Msc"] [80,70]

hive> select name,myq from infos lateral view explode(qual) ql as myq;

Ravi BTech

Ravi MTech

Ravi MBA

Raghu Bsc

Raghu Msc

Time taken: 0.064 seconds, Fetched: 5 row(s)

hive> select name,myp from infos lateral view explode(percentage) p as myp;

Ravi 89

Ravi 79

Ravi 60

Raghu 80

Raghu 70

Time taken: 0.056 seconds, Fetched: 5 row(s)

Hive Table types:

Internal table

If internal table dropped both meta data and data will be dropped.

External table

Only meta data will be dropped.

Partitioned and non-partitioned tables:

By default each table is non-partitioned.

partition is a subdirectory in a table directory of hdfs.

Advantage of partition :

No need to scan all the table data.

[cloudera@quickstart ~]$ cat > emp

101,aaaa,40000,m,11

102,bbbb,50000,f,12

103,cccc,90000,m,12

104,dddd,10000,f,13

105,eeee,20000,m,11

[cloudera@quickstart ~]$ cat > emp2

201,ee,80000,f,12

202,xx,90000,m,13

[cloudera@quickstart ~]$ cat > emp3

301,aaaaa,900000,m,11

302,iop,100000,f,12

303,hhg,200000,m,11

304,ghgh,300000,f,13

hive> create table emp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ',';

hive> load data local inpath 'emp' into table emp;

Loading data to table practice.emp

Table practice.emp stats: [numFiles=1, totalSize=100]

hive> load data local inpath 'emp2' into table emp;

Loading data to table practice.emp

Table practice.emp stats: [numFiles=1, totalSize=100]

hive> load data local inpath 'emp3' into table emp;

Loading data to table practice.emp

Table practice.emp stats: [numFiles=1, totalSize=100]

hive> select * from emp;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

201 ee 80000 f 12

202 xx 90000 m 13

301 aaaaa 900000 m 11

302 iop 100000 f 12

303 hhg 200000 m 11

304 ghgh 300000 f 13

Time taken: 0.066 seconds, Fetched: 11 row(s)

hive> describe formatted emp;

# col_name data_type comment

id int

name string

sal int

sex string

dno int

# Detailed Table Information

Database: practice

Owner: cloudera

CreateTime: Sun Nov 18 22:05:00 PST 2018

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/practice.db/emp

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true

numFiles 3

totalSize 219

transient_lastDdlTime 1542607585

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

field.delim ,

serialization.format ,

Time taken: 0.054 seconds, Fetched: 34 row(s)

//Non partitioned table so no subdirectories. just table directory only...

hdfs dfs -ls /user/hive/warehouse/practice.db/emp

Found 3 items

-rwxrwxrwx 1 cloudera supergroup 100 2018-11-18 22:06 /user/hive/warehouse/practice.db/emp/emp

-rwxrwxrwx 1 cloudera supergroup 36 2018-11-18 22:06 /user/hive/warehouse/practice.db/emp/emp2

-rwxrwxrwx 1 cloudera supergroup 83 2018-11-18 22:06 /user/hive/warehouse/practice.db/emp/emp3

Partitions are not created when you create a table.

Partitions are created when you load data.

create table epart(id int, name string, sal int, sex string, dno int) partitioned by (s string);

hive> insert overwrite table epart partition (s = 'f') select * from emp where sex = 'f';

hive> insert overwrite table epart partition (s = 'm') select * from emp where sex = 'm';

hive> select * from epart;

102 bbbb 50000 f 12 f

104 dddd 10000 f 13 f

201 ee 80000 f 12 f

302 iop 100000 f 12 f

304 ghgh 300000 f 13 f

101 aaaa 40000 m 11 m

103 cccc 90000 m 12 m

105 eeee 20000 m 11 m

202 xx 90000 m 13 m

301 aaaaa 900000 m 11 m

303 hhg 200000 m 11 m

Time taken: 0.172 seconds, Fetched: 11 row(s)

hive> describe epart;

id int

name string

sal int

sex string

dno int

s string

# Partition Information

# col_name data_type comment

s string

Time taken: 0.191 seconds, Fetched: 11 row(s)

hive> describe formatted epart;

# col_name data_type comment

id int

name string

sal int

sex string

dno int

# Partition Information

# col_name data_type comment

s string

# Detailed Table Information

Database: practice

Owner: cloudera

CreateTime: Sun Nov 18 22:22:37 PST 2018

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/practice.db/epart

Table Type: MANAGED_TABLE

Table Parameters:

numPartitions 2

transient_lastDdlTime 1542608557

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

serialization.format 1

Time taken: 0.106 seconds, Fetched: 36 row(s)

hdfs dfs -ls /user/hive/warehouse/practice.db/epart

Found 2 items

drwxrwxrwx - cloudera supergroup 0 2018-11-18 22:24 /user/hive/warehouse/practice.db/epart/s=f

drwxrwxrwx - cloudera supergroup 0 2018-11-18 22:30 /user/hive/warehouse/practice.db/epart/s=m

hdfs dfs -ls /user/hive/warehouse/practice.db/epart/s=f

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 99 2018-11-18 22:24 /user/hive/warehouse/practice.db/epart/s=f/000000_0

hdfs dfs -cat /user/hive/warehouse/practice.db/epart/s=f/000000_0

102 bbbb 50000 f 12

104 dddd 10000 f 13

201 ee 80000 f 12

302 iop 100000 f 12

304 ghgh 300000 f 13

hdfs dfs -cat /user/hive/warehouse/practice.db/epart/s=m/000000_0

101 aaaa 40000 m 11

103 cccc 90000 m 12

105 eeee 20000 m 11

202 xx 90000 m 13

301 aaaaa 900000 m 11

303 hhg 200000 m 11

/user/hive/warehouse/practice.db -- database name as directory

/epart -- table name as directory

/s=m -- partition name as sub directory

/000000_0 -- data will be here as file

/user/hive/warehouse/practice.db -- database name as directory

/epart -- table name as directory

/s=f -- partition name as sub directory

/000000_0 -- data will be here as file

hive> select * from epart;

102 bbbb 50000 f 12 f

104 dddd 10000 f 13 f

201 ee 80000 f 12 f

302 iop 100000 f 12 f

304 ghgh 300000 f 13 f

101 aaaa 40000 m 11 m

103 cccc 90000 m 12 m

105 eeee 20000 m 11 m

202 xx 90000 m 13 m

301 aaaaa 900000 m 11 m

303 hhg 200000 m 11 m

Time taken: 0.074 seconds, Fetched: 11 row(s)

hive> select * from epart where s ='f';

102 bbbb 50000 f 12 f

104 dddd 10000 f 13 f

201 ee 80000 f 12 f

302 iop 100000 f 12 f

304 ghgh 300000 f 13 f

Time taken: 0.058 seconds, Fetched: 5 row(s)

hive> select * from epart where s ='m';

101 aaaa 40000 m 11 m

103 cccc 90000 m 12 m

105 eeee 20000 m 11 m

202 xx 90000 m 13 m

301 aaaaa 900000 m 11 m

303 hhg 200000 m 11 m

Time taken: 0.073 seconds, Fetched: 6 row(s)

hive>

We should request partioned column based condition to get the advantage of partition.

// filter condition based on non-partitioned column - no advantage

hive> select * from epart where sex ='m';

101 aaaa 40000 m 11 m

103 cccc 90000 m 12 m

105 eeee 20000 m 11 m

202 xx 90000 m 13 m

301 aaaaa 900000 m 11 m

303 hhg 200000 m 11 m

Time taken: 0.074 seconds, Fetched: 6 row(s)

hive> select * from epart where sex ='f';

102 bbbb 50000 f 12 f

104 dddd 10000 f 13 f

201 ee 80000 f 12 f

302 iop 100000 f 12 f

304 ghgh 300000 f 13 f

Time taken: 0.076 seconds, Fetched: 5 row(s)

We are gonig to make Department number based partition now.

hive> create table eparts(id int, name string, sal int, sex string, dno int) partitioned by (d int) row format delimited fields terminated by ',';

// static partition - becauase we need to run the following queries one by one manually (statically)

hive> insert overwrite table eparts partition (d = 11) select * from emp where dno = 11;

hive> insert overwrite table eparts partition (d = 12) select * from emp where dno = 12;

hive> insert overwrite table eparts partition (d = 13) select * from emp where dno = 13;

hive> insert overwrite table eparts partition (d = 14) select * from emp where dno = 14;

hive> insert overwrite table eparts partition (d = 15) select * from emp where dno = 15;

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/eparts

Found 5 items

drwxrwxrwx - cloudera supergroup 0 2018-11-18 22:50 /user/hive/warehouse/practice.db/eparts/d=11

drwxrwxrwx - cloudera supergroup 0 2018-11-18 22:51 /user/hive/warehouse/practice.db/eparts/d=12

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:04 /user/hive/warehouse/practice.db/eparts/d=13

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:04 /user/hive/warehouse/practice.db/eparts/d=14

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:05 /user/hive/warehouse/practice.db/eparts/d=15

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/eparts/d=11

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 82 2018-11-18 22:50 /user/hive/warehouse/practice.db/eparts/d=11/000000_0

[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/practice.db/eparts/d=11/000000_0

101,aaaa,40000,m,11

105,eeee,20000,m,11

301,aaaaa,900000,m,11

303,hhg,200000,m,11

[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/practice.db/eparts/d=12/000000_0

102,bbbb,50000,f,12

103,cccc,90000,m,12

201,ee,80000,f,12

302,iop,100000,f,12

[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/practice.db/eparts/d=13/000000_0

104,dddd,10000,f,13

202,xx,90000,m,13

304,ghgh,300000,f,13

// Initially we created partition based on sex ='m' or sex ='f' (s='m' or s='f')

// then we created partition based on dno (dno = 11, dno=12..)

we can also create partition using multiple columns together.

//partition based on multiple columns...

hive> create table mpart (id int, name string, sal int, sex string, dno int) partitioned by(d int, s string);

insert overwrite table mpart partition(d=11, s='f') select * from emp where dno=11 and sex ='f';

insert overwrite table mpart partition(d=11, s='m') select * from emp where dno=11 and sex ='m';

insert overwrite table mpart partition(d=12, s='f') select * from emp where dno=12 and sex ='f';

insert overwrite table mpart partition(d=12, s='m') select * from emp where dno=12 and sex ='m';

insert overwrite table mpart partition(d=13, s='f') select * from emp where dno=13 and sex ='f';

insert overwrite table mpart partition(d=13, s='m') select * from emp where dno=13 and sex ='m';

hdfs dfs -ls /user/hive/warehouse/practice.db/mpart

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:18 /user/hive/warehouse/practice.db/mpart/d=11

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/mpart/d=11

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:18 /user/hive/warehouse/practice.db/mpart/d=11/s=f

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/mpart/d=11/s=f

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 0 2018-11-18 23:18 /user/hive/warehouse/practice.db/mpart/d=11/s=f/000000_0

[cloudera@quickstart ~]$ hdfs dfs -cat /user/hive/warehouse/practice.db/mpart/d=11/s=f/000000_0

hdfs dfs -ls /user/hive/warehouse/practice.db/mpart/d=11

Found 2 items

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:18 /user/hive/warehouse/practice.db/mpart/d=11/s=f

drwxrwxrwx - cloudera supergroup 0 2018-11-18 23:22 /user/hive/warehouse/practice.db/mpart/d=11/s=m

hdfs dfs -cat /user/hive/warehouse/practice.db/mpart/d=11/s=m/000000_0

101 aaaa 40000 m 11

105 eeee 20000 m 11

301 aaaaa 900000 m 11

303 hhg 200000 m 11

/user/hive/warehouse/practice.db/mpart

/d=11

/s=f/000000_0

/s=m/000000_0

/d=12

/s=f/000000_0

/s=m/000000_0

/d=13

/s=f/000000_0

/s=m/000000_0

Dynamic partitions with multiple columns:

-------------------------------------------

create table dpart(id int, name string, sal int, sex string, dno int) partitioned by (d int, s string) row format delimited fields terminated by ',';

settings to enable dynamic partitions:

---------------------------------------

hive> set hive.exec.dynamic.partition = true;

hive> set hive.exec.dynamic.partition.mode = nonstrict;

insert overwrite table dpart partition(d,s) select id,name,sal,sex,dno,dno,sex from emp;

// look here -- dynamic partitions created

hdfs dfs -ls /user/hive/warehouse/practice.db/dpart

Found 3 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=11

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=12

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=13

/// based on the input data the following sub folders created dynamically (on the fly)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/dpart/d=11

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=11/s=m

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/dpart/d=12

Found 2 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=12/s=f

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=12/s=m

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/dpart/d=13

Found 2 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=13/s=f

drwxrwxrwx - cloudera supergroup 0 2018-11-19 00:34 /user/hive/warehouse/practice.db/dpart/d=13/s=m

// without scanning all the records it directly fetch from d=13 and s=f folders

hive> select * from dpart where d = 13 and s='f';

104 dddd 10000 f 13 13 f

304 ghgh 300000 f 13 13 f

Time taken: 0.178 seconds, Fetched: 2 row(s)

sample data:

-------------

gedit sales.txt

----------------

01/01/2011,45000

02/01/2011,46000

03/01/2011,45500

04/01/2011,46050

05/01/2011,45005

06/01/2011,46010

07/01/2011,45200

08/01/2011,43000

09/01/2011,48000

10/01/2011,42001

11/01/2011,45002

12/01/2011,46003

13/01/2011,45504

14/01/2011,46055

15/01/2011,45006

16/01/2011,46017

17/01/2011,45208

18/01/2011,43009

19/01/2011,48010

20/01/2011,42011

01/01/2011,45000

02/01/2011,46000

03/01/2011,45500

04/01/2011,46050

05/01/2011,45005

06/01/2011,46010

07/01/2011,45200

08/01/2011,43000

09/01/2011,48000

10/01/2011,42001

11/01/2011,45002

12/01/2011,46003

13/01/2011,45504

14/01/2011,46055

15/01/2011,45006

16/01/2011,46017

17/01/2011,45208

18/01/2011,43009

19/01/2011,48010

20/01/2011,42011

01/01/2011,45000

02/01/2011,46000

03/01/2011,45500

04/01/2011,46050

05/01/2011,45005

06/01/2011,46010

07/01/2011,45200

08/01/2011,43000

09/01/2011,48000

10/01/2011,42001

11/01/2011,45002

12/01/2011,46003

13/01/2011,45504

14/01/2011,46055

15/01/2011,45006

16/01/2011,46017

17/01/2011,45208

18/01/2011,43009

19/01/2011,48010

20/01/2011,42011

hive> create table raw (dt string, amt int) row format delimited fields terminated by ',';

Time taken: 0.073 seconds

hive> load data local inpath 'sales.txt' into table raw;

Loading data to table practice.raw

Table practice.raw stats: [numFiles=1, totalSize=52]

Time taken: 0.182 second

hive> select * from raw;

01/01/2011 45000

02/01/2011 46000

03/01/2011 45500

04/01/2011 46050

05/01/2011 45005

06/01/2011 46010

07/01/2011 45200

08/01/2011 43000

09/01/2011 48000

10/01/2011 42001

11/01/2011 45002

12/01/2011 46003

13/01/2011 45504

14/01/2011 46055

15/01/2011 45006

16/01/2011 46017

17/01/2011 45208

18/01/2011 43009

19/01/2011 48010

20/01/2011 42011

Time taken: 0.047 seconds, Fetched: 20 row(s)

RDBMS date format is : 20/01/2011 (MM/DD/YYYY)

but Hive Date Format is : 2017-01-23 (YYYY-MM-DD)

hive> create table raw2 (dt array, amt int);

Time taken: 0.054 seconds

hive> insert into table raw2 select split(dt,'/'), amt from raw;

hive> select split(dt,'/'), amt from raw;

["01","01","2011"] 45000

["02","01","2011"] 46000

hive> select * from raw2;

["01","01","2011"] 45000

["02","01","2011"] 46000

hive> select concat(dt[2],'-',dt[0],'-',dt[1]) from raw2;

2011-01-01

2011-02-01

2011-03-01

hive> create table rawx like raw;

// make duplicate rows for 2012, 2013...

insert into table rawx select dt, amt from raw union all select concat(substr(dt,1,9),'2') as dt, amt+2000 as amt from raw union all select concat(substr(dt,1,9),'3') as dt, amt+10000 as amt from raw;

//rawx has 3 years data

hive> select * from rawx;

01/01/2011 45000

01/01/2012 47000

01/01/2013 55000

02/01/2011 46000

hive> insert overwrite table raw2 select split(dt,'/'), amt from rawx;

hive> set hive.exec.dynamic.partition = true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> set hive.exec.max.dynamic.partitions.pernode = 100000;

hive> create table sales(dt string, amt int);

Time taken: 0.068 seconds

hive> insert into table sales select concat(dt[2],'-',dt[0],'-',dt[1]),amt from raw2;

hive> select * from sales limit 3;

2011-01-01 45000

2012-01-01 47000

2013-01-01 55000

Time taken: 0.058 seconds, Fetched: 3 row(s)

hive> create table spart (dt string, amt int) partitioned by (y int, m int, d int) row format delimited fields terminated by ',';

hive> insert overwrite table spart partition (y,m,d) select dt, amt, year(dt),month(dt),day(dt) from sales;

hdfs dfs -ls /user/hive/warehouse/practice.db/spart

Found 4 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2012

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2013

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2014

ound 4 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2012

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2013

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2014

[cloudera@quickstart ~]$

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/spart

Found 4 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2012

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2013

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2014

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/spart/y=2011

Found 12 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=1

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=10

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=11

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=12

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=2

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=3

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=4

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=5

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=6

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=7

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=8

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=9

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/spart/y=2011/m=1

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=1/d=1

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/practice.db/spart/y=2011/m=11

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-19 01:30 /user/hive/warehouse/practice.db/spart/y=2011/m=11/d=1

[cloudera@quickstart ~]$

Static:

partition (y=1999,M=12,D=30) -- static

dynamic:

partition (y,m,d)

select * from sales where dt = '2017-08-07' -- it will scan all the records.. bad performance

hive> select * from spart where m = 1 and d = 1;

2011-01-01 45000 2011 1 1

2012-01-01 47000 2012 1 1

2011-13-01 45504 2012 1 1

2012-01-01 47000 2012 1 1

2011-13-01 45504 2012 1 1

2012-01-01 47000 2012 1 1

2011-13-01 45504 2012 1 1

2013-01-01 55000 2013 1 1

2012-13-01 47504 2013 1 1

2013-01-01 55000 2013 1 1

2012-13-01 47504 2013 1 1

2013-01-01 55000 2013 1 1

2012-13-01 47504 2013 1 1

2013-13-01 55504 2014 1 1

Time taken: 0.051 seconds, Fetched: 18 row(s)

hive> select * from spart where y = 2011 and m = 1 and d = 1;

2011-01-01 45000 2011 1 1

Time taken: 0.052 seconds, Fetched: 3 row(s)

hive>

xml data processing using Hive:

sample xml data:

Ravi

Rani

if the above file is loaded into hive table, there will be 8 records ( rows)

we need to transform each vertical xml recor into horizantal xml record.

[using MapReduce / Spark ]

After transformation:

---------------------

Ravi25

Ranif

$ cat > myxml.xml

Ravi25

Ranif

[cloudera@quickstart ~]$ cat myxml.xml

Ravi25

Ranif

Now records are converted from vertical into horizantal

xml is semi-structured

hive> create database xmls;

Time taken: 0.211 seconds

hive> use xmls;

Time taken: 0.008 seconds

hive> create table raw (line string);

Time taken: 0.061 seconds

hive> load data local inpath 'myxml.xml' into table raw;

Loading data to table xmls.raw

Table xmls.raw stats: [numFiles=1, totalSize=83]

Time taken: 0.123 seconds

hive> select * from raw;

Ravi25

Ranif

Time taken: 0.028 seconds, Fetched: 2 row(s)

hive> select xpath_string(line,'rec/name') from raw;

Ravi

Rani

Time taken: 0.035 seconds, Fetched: 2 row(s)

hive> select xpath_int(line,'rec/age') from raw;

Time taken: 0.035 seconds, Fetched: 2 row(s)

if the string field is missing, you will get space

if the numeric field is missing, you will get 0.

ive> insert into table info select xpath_string(line,'rec/name'), xpath_int(line,'rec/age'), xpath_string(line,'rec/sex') from raw;

hive> select * from info;

Ravi 25

Rani 0 f

Time taken: 0.035 seconds, Fetched: 2 row(s)

cat > xml2

SiriVeni23siri@gmail.comsiri@ibm.com988617737598861773759886177375Hyderabad

hive> create table xraw (line string);

Time taken: 0.03 seconds

hive> load data local inpath 'xml2' into table xraw;

Loading data to table xmls.xraw

Table xmls.xraw stats: [numFiles=1, totalSize=302]

Time taken: 0.126 seconds

hive> select * from xraw;

SiriVeni23siri@gmail.comsiri@ibm.com988617737598861773759886177375Hyderabad

Time taken: 0.031 seconds, Fetched: 1 row(s)

hive> select xpath_string(line,'rec/name/fname'), xpath_string(line,'rec/name/lname'),xpath_int(line,'rec/age'), xpath_string(line,'rec/contact/email/personal'),xpath_string(line,'rec/contact/email/official'),xpath_string(line,'rec/contact/phone/mobile'),xpath_string(line,'rec/contact/phone/office'),xpath_string(line,'rec/contact/phone/residence'),xpath_string(line,'rec/city') from xraw;

Siri Veni 23 siri@gmail.com siri@ibm.com 9886177375 9886177375 9886177375 Hyderabad

Time taken: 0.058 seconds, Fetched: 1 row(s)

hive> create table xinfo(fname string, lname string, age int, personal_email string, official_email string, mobile string, office_phone string, residence_phone string, city string);

Time taken: 0.033 seconds

hive>

hive> insert into table xinfo select xpath_string(line,'rec/name/fname'), xpath_string(line,'rec/name/lname'),xpath_int(line,'rec/age'), xpath_string(line,'rec/contact/email/personal'),xpath_string(line,'rec/contact/email/official'),xpath_string(line,'rec/contact/phone/mobile'),xpath_string(line,'rec/contact/phone/office'),xpath_string(line,'rec/contact/phone/residence'),xpath_string(line,'rec/city') from xraw;

hive> select * from xraw;

SiriVeni23siri@gmail.comsiri@ibm.com988617737598861773759886177375Hyderabad

Time taken: 0.602 seconds, Fetched: 1 row(s)

hive> select * from xinfo;

Siri Veni 23 siri@gmail.com siri@ibm.com 9886177375 9886177375 9886177375 Hyderabad

Time taken: 0.038 seconds, Fetched: 1 row(s)

hive>

cat > xml3.xml

RaviBtechMTech

GiriBtechMTechPhD

VeniBscMscMTech

hive> use xmls;

Time taken: 0.011 seconds

hive> create table xmlraw(line string);

Time taken: 0.121 seconds

hive> load data local inpath 'xml3.xml' into table xmlraw;

Loading data to table xmls.xmlraw

Table xmls.xmlraw stats: [numFiles=1, totalSize=226]

Time taken: 0.432 seconds

hive> select * from xmlraw;

RaviBtechMTech

GiriBtechMTechPhD

VeniBscMscMTech

Time taken: 0.061 seconds, Fetched: 3 row(s)

hive>

hive> select xpath_string(line,'rec/name') from xmlraw;

Ravi

Giri

Veni

Time taken: 0.085 seconds, Fetched: 3 row(s)

// Here it fetches only the 1st qualification.

hive> select xpath_string(line,'rec/qual') from xmlraw;

Btech

Bsc

Time taken: 0.055 seconds, Fetched: 3 row(s)

hive> select xpath(line,'rec/qual/text()') from xmlraw;

["Btech","MTech"]

["Btech","MTech","PhD"]

["Bsc","Msc","MTech"]

Time taken: 0.044 seconds, Fetched: 3 row(s)

hive> insert overwrite table raw2 select xpath_string(line,'rec/name'),xpath(line,'rec/qual/text()') from xmlraw;

hive> select * from raw2;

Ravi ["Btech","MTech"]

Giri ["Btech","MTech","PhD"]

Veni ["Bsc","Msc","MTech"]

hive> select name,size(qual) from raw2;

Ravi 2

Giri 3

Veni 3

Time taken: 0.039 seconds, Fetched: 3 row(s)

hive> select * from raw2 where array_contains(qual,'Bsc');

Veni ["Bsc","Msc","MTech"]

Time taken: 0.047 seconds, Fetched: 1 row(s)

hive> select * from raw2;

Ravi ["Btech","MTech"]

Giri ["Btech","MTech","PhD"]

Veni ["Bsc","Msc","MTech"]

Time taken: 0.067 seconds, Fetched: 3 row(s)

hive> select * from raw2 where array_contains(qual,'Btech');

Ravi ["Btech","MTech"]

Giri ["Btech","MTech","PhD"]

Time taken: 0.048 seconds, Fetched: 2 row(s)

hive> select explode(qual) as q from raw2;

Btech

MTech

Btech

MTech

PhD

Bsc

Msc

MTech

Time taken: 0.04 seconds, Fetched: 8 row(s)

hive> select name, myq from raw2 lateral view explode(qual) q as myq;

Ravi Btech

Ravi MTech

Giri Btech

Giri MTech

Giri PhD

Veni Bsc

Veni Msc

Veni MTech

Time taken: 0.045 seconds, Fetched: 8 row(s)

cat xml4.xml

101100020004000

10230005000

1015000

hive> create table sraw(line string);

Time taken: 0.052 seconds

hive> load data local inpath 'xml4.xml' into table sraw;

Loading data to table xmls.sraw

Table xmls.sraw stats: [numFiles=1, totalSize=150]

Time taken: 0.106 seconds

hive> select * from sraw;

101100020004000

10230005000

1015000

Time taken: 0.041 seconds, Fetched: 3 row(s)

hive> select xpath_int(line,'tr/cid') from sraw;

101

102

101

Time taken: 0.044 seconds, Fetched: 3 row(s)

hive> select xpath(line,'tr/pr/text()') from sraw;

["1000","2000","4000"]

["3000","5000"]

["5000"]

Time taken: 0.045 seconds, Fetched: 3 row(s)

hive> create table sraw2(cid int, pr array);

Time taken: 0.035 seconds

hive> insert into table sraw2 select xpath_int(line,'tr/cid'),xpath(line,'tr/pr/text()') from sraw;

hive> select * from sraw2;

101 ["1000","2000","4000"]

102 ["3000","5000"]

101 ["5000"]

Time taken: 0.04 seconds, Fetched: 3 row(s)

hive> select * from sraw;

101100020004000

10230005000

1015000

Time taken: 0.041 seconds, Fetched: 3 row(s)

hive> select * from sraw2;

101 ["1000","2000","4000"]

102 ["3000","5000"]

101 ["5000"]

Time taken: 0.033 seconds, Fetched: 3 row(s)

hive> select explode(pr) from sraw2;

1000

2000

4000

3000

5000

Time taken: 0.041 seconds, Fetched: 6 row(s)

hive> create table sales(cid int, pr int) row format delimited fields terminated by ',';

Time taken: 0.052 seconds

hive> insert into table sales select cid, mypr from sraw2 lateral view explode(pr) p as mypr;

hive> select * from sales;

101 1000

101 2000

101 4000

102 3000

102 5000

101 5000

Time taken: 0.037 seconds, Fetched: 6 row(s)

hive> create table results (cid int, tot int);

Time taken: 0.039 seconds

hive> insert into table results select cid, sum(pr) from sales group by cid;

hive> select * from results;

101 12000

102 8000

Time taken: 0.045 seconds, Fetched: 2 row(s)

cat xml5.xml

101100020004000

10230005000

hive> create table xxraw(line string);

Time taken: 0.047 seconds

hive> load data local inpath 'xml5.xml' into table xxraw;

Loading data to table xmls.xxraw

Table xmls.xxraw stats: [numFiles=1, totalSize=145]

Time taken: 0.115 seconds

hive> select * from xxraw;

101100020004000

10230005000

Time taken: 0.035 seconds, Fetched: 2 row(s)

hive> select xpath_int(line,'tr/cid') from xxraw;

101

102

Time taken: 0.044 seconds, Fetched: 2 row(s)

hive> select xpath(line,'tr/cid/text()') from xxraw;

["101"]

["102"]

Time taken: 0.056 seconds, Fetched: 2 row(s)

hive> select xpath(line,'tr/pr/text()') from xxraw;

["1000","2000","4000"]

["3000","5000"]

Time taken: 0.046 seconds, Fetched: 2 row(s)

hive> select xpath(line,'tr/pr/@id') from xxraw;

["p1","p7"]

["p7","p2"]

Time taken: 0.04 seconds, Fetched: 2 row(s)

hive> select xpath_int(line,'tr/cid'), xpath(line,'tr/pr/@id'),xpath(line,'tr/pr/text()') from xxraw;

101 ["p1","p7"] ["1000","2000","4000"]

102 ["p7","p2"] ["3000","5000"]

Time taken: 0.087 seconds, Fetched: 2 row(s)

JSON integration with Hive:

---------------------------

cat json1.json

{"name":"Ravi","age":25}

{"name":"Rani","city":"Hyderabad"}

{"name":"Mani","age":24,"city":"Delhi"}

hive> create database jsons;

Time taken: 0.068 seconds

hive> use jsons;

Time taken: 0.007 seconds

hive> create table raw(line string);

Time taken: 0.07 seconds

hive> load data local inpath 'json1.json' into table raw;

Loading data to table jsons.raw

Table jsons.raw stats: [numFiles=1, totalSize=100]

Time taken: 0.121 seconds

hive> select * from raw;

{"name":"Ravi","age":25}

{"name":"Rani","city":"Hyderabad"}

{"name":"Mani","age":24,"city":"Delhi"}

Time taken: 0.041 seconds, Fetched: 3 row(s)

hive> select get_json_object(line,'$.name') from raw;

Ravi

Rani

Mani

Time taken: 0.052 seconds, Fetched: 3 row(s)

hive> select get_json_object(line,'$.age') from raw;

NULL

Time taken: 0.052 seconds, Fetched: 3 row(s)

hive> select get_json_object(line,'$.city') from raw;

NULL

Hyderabad

Delhi

Time taken: 0.045 seconds, Fetched: 3 row(s)

hive> select get_json_object(line,'$.name'),get_json_object(line,'$.age'),get_json_object(line,'$.city') from raw;

Ravi 25 NULL

Rani NULL Hyderabad

Mani 24 Delhi

Time taken: 0.036 seconds, Fetched: 3 row(s)

hive> select x.* from raw lateral view json_tuple(line,'name','age','city') x as n,a,c;

Ravi 25 NULL

Rani NULL Hyderabad

Mani 24 Delhi

hive> create table info (name string, age int, city string);

Time taken: 0.026 seconds

hive> insert overwrite table info select x.* from raw lateral view json_tuple(line,'name','age','city') x as n,a,c;

hive> select * from info;

Ravi 25 NULL

Rani NULL Hyderabad

Mani 24 Delhi

Time taken: 0.03 seconds, Fetched: 3 row(s)

get_json_object() -- udf

json_tuple() -- udtf

if any field content is empty, it will return null

cat json2.json

{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"Hyd"},"city":"Delhi"}

{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"Hyd"},"city":"Hyd"}

hive> create table jraw(line string);

Time taken: 0.071 seconds

hive> load data local inpath 'json2.json' into table jraw;

Loading data to table jsons.jraw

Table jsons.jraw stats: [numFiles=1, totalSize=175]

Time taken: 0.117 seconds

hive> select * from jraw;

{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"Hyd"},"city":"Delhi"}

{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"Hyd"},"city":"Hyd"}

Time taken: 0.031 seconds, Fetched: 2 row(s)

hive> create table raw2(name string, age int, wife string, city string);

Time taken: 0.026 seconds

hive> insert into table raw2 select x.* from jraw lateral view json_tuple(line,'name','age','wife','city') x as n,a,w,c;

hive> select * from raw2;

Ravi 25 {"name":"Rani","age":24,"city":"Hyd"} Delhi

Kiran 30 {"name":"Veni","qual":"BTech","city":"Hyd"} Hyd

Time taken: 0.03 seconds, Fetched: 2 row(s)

hive> select * from jraw;

{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"Hyd"},"city":"Delhi"}

{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"Hyd"},"city":"Hyd"}

Time taken: 0.029 seconds, Fetched: 2 row(s)

hive> select * from raw2;

Ravi 25 {"name":"Rani","age":24,"city":"Hyd"} Delhi

Kiran 30 {"name":"Veni","qual":"BTech","city":"Hyd"} Hyd

Time taken: 0.027 seconds, Fetched: 2 row(s)

hive> create table jinfo(hname string, hage int, hcity string, wname string, wage int, wcity string, wqual string) row format delimited fields terminated by ',';

Time taken: 0.047 seconds

hive> insert into table jinfo select name, age,city,get_json_object(wife,'$.name'),get_json_object(wife,'$.age'),get_json_object(wife,'$.city'),get_json_object(wife,'$.qual') from raw2;

hive> select * from jinfo;

Ravi 25 Delhi Rani 24 Hyd NULL

Kiran 30 Hyd Veni NULL Hyd BTech

Time taken: 0.027 seconds, Fetched: 2 row(s)

hive> select * from jraw;

{"name":"Ravi","age":25,"wife":{"name":"Rani","age":24,"city":"Hyd"},"city":"Delhi"}

{"name":"Kiran","age":30,"wife":{"name":"Veni","qual":"BTech","city":"Hyd"},"city":"Hyd"}

Time taken: 0.035 seconds, Fetched: 2 row(s)

hive> select * from raw2;

Ravi 25 {"name":"Rani","age":24,"city":"Hyd"} Delhi

Kiran 30 {"name":"Veni","qual":"BTech","city":"Hyd"} Hyd

Time taken: 0.043 seconds, Fetched: 2 row(s)

hive> select * from jinfo;

Ravi 25 Delhi Rani 24 Hyd NULL

Kiran 30 Hyd Veni NULL Hyd BTech

Time taken: 0.027 seconds, Fetched: 2 row(s)

cat json3.json

{"name":"Ravi","qual":["btech","mtech","phd"]}

{"name":"Mani","qual":["bsc","mba"]}

{"name":"Rani","qual":["bsc","msc","mtech"]}

hive> create table jsraw(line string);

Time taken: 0.053 seconds

hive> load data local inpath 'json3.json' into table jsraw;

Loading data to table jsons.jsraw

Table jsons.jsraw stats: [numFiles=1, totalSize=129]

Time taken: 0.108 seconds

hive> select * from jsraw;

{"name":"Ravi","qual":["btech","mtech","phd"]}

{"name":"Mani","qual":["bsc","mba"]}

{"name":"Rani","qual":["bsc","msc","mtech"]}

Time taken: 0.032 seconds, Fetched: 3 row(s)

hive> insert into table jsraw2 select x.* from jsraw lateral view json_tuple(line,'name','qual') x as n,q;

hive> select * from jsraw2;

Ravi ["btech","mtech","phd"]

Mani ["bsc","mba"]

Rani ["bsc","msc","mtech"]

Time taken: 0.027 seconds, Fetched: 3 row(s)

hive> select split(qual,',') from jsraw2;

["[\"btech\"","\"mtech\"","\"phd\"]"]

["[\"bsc\"","\"mba\"]"]

["[\"bsc\"","\"msc\"","\"mtech\"]"]

Time taken: 0.035 seconds, Fetched: 3 row(s)

hive> create table raw3 (name string, qual array);

Time taken: 0.03 seconds

hive> insert into table raw3 select name, split(qual,',') from jsraw2;

hive> select * from raw3;

Ravi ["[\"btech\"","\"mtech\"","\"phd\"]"]

Mani ["[\"bsc\"","\"mba\"]"]

Rani ["[\"bsc\"","\"msc\"","\"mtech\"]"]

Time taken: 0.033 seconds, Fetched: 3 row(s)

hive> select explode(qual) from raw3;

["btech"

"mtech"

"phd"]

["bsc"

"mba"]

["bsc"

"msc"

"mtech"]

Time taken: 0.05 seconds, Fetched: 8 row(s)

hive> create table raw4(name string, qual string);

Time taken: 0.03 seconds

hive> insert overwrite table raw4 select name, myq from raw3 lateral view explode(qual) q as myq;

hive> select * from raw4;

Ravi ["btech"

Ravi "mtech"

Ravi "phd"]

Mani ["bsc"

Mani "mba"]

Rani ["bsc"

Rani "msc"

Rani "mtech"]

Time taken: 0.029 seconds, Fetched: 8 row(s)

hive> select split(qual,'"') from raw4;

["[","btech",""]

["","mtech",""]

["","phd","]"]

["[","bsc",""]

["","mba","]"]

["[","bsc",""]

["","msc",""]

["","mtech","]"]

Time taken: 0.032 seconds, Fetched: 8 row(s)

hive> select split(qual,'"')[1] from raw4;

btech

mtech

phd

bsc

mba

bsc

msc

mtech

Time taken: 0.039 seconds, Fetched: 8 row(s)

hive> create table jsinfo like raw4;

Time taken: 0.054 seconds

hive> insert into table jsinfo select name,split(qual,'"')[1] from raw4;

hive> select * from jsinfo;

Ravi btech

Ravi mtech

Ravi phd

Mani bsc

Mani mba

Rani bsc

Rani msc

Rani mtech

Time taken: 0.036 seconds, Fetched: 8 row(s)

hive> select * from jsraw;

{"name":"Ravi","qual":["btech","mtech","phd"]}

{"name":"Mani","qual":["bsc","mba"]}

{"name":"Rani","qual":["bsc","msc","mtech"]}

Time taken: 0.069 seconds, Fetched: 3 row(s)

hive> select * from jsraw2;

Ravi ["btech","mtech","phd"]

Mani ["bsc","mba"]

Rani ["bsc","msc","mtech"]

Time taken: 0.031 seconds, Fetched: 3 row(s)

hive> select * from raw3;

Ravi ["[\"btech\"","\"mtech\"","\"phd\"]"]

Mani ["[\"bsc\"","\"mba\"]"]

Rani ["[\"bsc\"","\"msc\"","\"mtech\"]"]

Time taken: 0.031 seconds, Fetched: 3 row(s)

hive> select * from raw4;

Ravi ["btech"

Ravi "mtech"

Ravi "phd"]

Mani ["bsc"

Mani "mba"]

Rani ["bsc"

Rani "msc"

Rani "mtech"]

Time taken: 0.032 seconds, Fetched: 8 row(s)

hive> select * from jsinfo;

Ravi btech

Ravi mtech

Ravi phd

Mani bsc

Mani mba

Rani bsc

Rani msc

Rani mtech

Time taken: 0.05 seconds, Fetched: 8 row(s)

URL parser:

------------

[cloudera@quickstart ~]$ cat urls.txt

http://training.com/bigdata/hadoop?id=101&name=Giri&age=23&city=hyd

http://training.com/bigdata/spark?id=101&name=Diri&sex=f&city=del

http://training.com/bigdata/spark?id=102&name=Xiri&age=33&sex=f

http://training.com/bigdata/spark?id=104&name=Miri&age=23&sex=m

hive> create database urls;

Time taken: 0.044 seconds

hive> use urls;

Time taken: 0.008 seconds

hive> create table raw(line string);

Time taken: 0.034 seconds

hive> load data local inpath 'urls.txt' into table raw;

Loading data to table urls.raw

Table urls.raw stats: [numFiles=1, totalSize=262]

Time taken: 0.101 seconds

hive> select * from raw;

http://training.com/bigdata/hadoop?id=101&name=Giri&age=23&city=hyd

http://training.com/bigdata/spark?id=101&name=Diri&sex=f&city=del

http://training.com/bigdata/spark?id=102&name=Xiri&age=33&sex=f

http://training.com/bigdata/spark?id=104&name=Miri&age=23&sex=m

Time taken: 0.026 seconds, Fetched: 4 row(s)

hive> select parse_url(line,'HOST') from raw;

training.com

Time taken: 0.032 seconds, Fetched: 4 row(s)

hive> select parse_url(line,'PATH') from raw;

/bigdata/hadoop

/bigdata/spark

Time taken: 0.032 seconds, Fetched: 4 row(s)

hive> select parse_url(line,'QUERY') from raw;

id=101&name=Giri&age=23&city=hyd

id=101&name=Diri&sex=f&city=del

id=102&name=Xiri&age=33&sex=f

id=104&name=Miri&age=23&sex=m

Time taken: 0.035 seconds, Fetched: 4 row(s)

hive> select parse_url(line,'HOST'), parse_url(line,'PATH'), parse_url(line,'QUERY') from raw;

training.com /bigdata/hadoop id=101&name=Giri&age=23&city=hyd

training.com /bigdata/spark id=101&name=Diri&sex=f&city=del

training.com /bigdata/spark id=102&name=Xiri&age=33&sex=f

training.com /bigdata/spark id=104&name=Miri&age=23&sex=m

Time taken: 0.062 seconds, Fetched: 4 row(s)

// using lateral view

hive> select x.* from raw lateral view parse_url_tuple(line,'HOST','PATH','QUERY') x as h,p,q;

training.com /bigdata/hadoop id=101&name=Giri&age=23&city=hyd

training.com /bigdata/spark id=101&name=Diri&sex=f&city=del

training.com /bigdata/spark id=102&name=Xiri&age=33&sex=f

training.com /bigdata/spark id=104&name=Miri&age=23&sex=m

Time taken: 0.048 seconds, Fetched: 4 row(s)

hive> insert into table raw2 select x.* from raw lateral view parse_url_tuple(line,'HOST','PATH','QUERY') x as h,p,q;

hive> select * from raw2;

training.com /bigdata/hadoop id=101&name=Giri&age=23&city=hyd

training.com /bigdata/spark id=101&name=Diri&sex=f&city=del

training.com /bigdata/spark id=102&name=Xiri&age=33&sex=f

training.com /bigdata/spark id=104&name=Miri&age=23&sex=m

Time taken: 0.029 seconds, Fetched: 4 row(s)

hive> select host,split(path,'/'),str_to_map(query,'&','=') from raw2;

training.com ["","bigdata","hadoop"] {"id":"101","name":"Giri","age":"23","city":"hyd"}

training.com ["","bigdata","spark"] {"id":"101","name":"Diri","sex":"f","city":"del"}

training.com ["","bigdata","spark"] {"id":"102","name":"Xiri","age":"33","sex":"f"}

training.com ["","bigdata","spark"] {"id":"104","name":"Miri","age":"23","sex":"m"}

Time taken: 0.031 seconds, Fetched: 4 row(s)

hive> create table raw3(host string, path array, qmap map);

hive> insert into table raw3 select host,split(path,'/'),str_to_map(query,'&','=') from raw2;

hive> select * from raw3;

training.com ["","bigdata","hadoop"] {"id":"101","name":"Giri","age":"23","city":"hyd"}

training.com ["","bigdata","spark"] {"id":"101","name":"Diri","sex":"f","city":"del"}

training.com ["","bigdata","spark"] {"id":"102","name":"Xiri","age":"33","sex":"f"}

training.com ["","bigdata","spark"] {"id":"104","name":"Miri","age":"23","sex":"m"}

Time taken: 0.029 seconds, Fetched: 4 row(s)

hive> create table info (host string, category string, course string, id int, name string, age int, sex string, city string);

Time taken: 0.05 seconds

hive> describe info;

host string

category string

course string

id int

name string

age int

sex string

city string

Time taken: 0.049 seconds, Fetched: 8 row(s)

hive> insert into table info select host, path[1],path[2],qmap['id'],qmap['name'],qmap['age'],qmap['sex'],qmap['city'] from raw3;

hive> select * from info;

training.com bigdata hadoop 101 Giri 23 NULL hyd

training.com bigdata spark 101 Diri NULL f del

training.com bigdata spark 102 Xiri 33 f NULL

training.com bigdata spark 104 Miri 23 m NULL

Time taken: 0.032 seconds, Fetched: 4 row(s)

hive> select sex,count(*) as cnt from info group by sex order by cnt desc ;

f 2

m 1

NULL 1

hive> create database mydb;

Time taken: 0.068 seconds

hive> use mydb;

Time taken: 0.007 seconds

hive> create table mytab(a int, b int, c int);

Time taken: 0.026 seconds

hive> describe mytab;

a int

b int

c int

Time taken: 0.073 seconds, Fetched: 3 row(s)

hive> describe formatted mytab;

# col_name data_type comment

a int

b int

c int

# Detailed Table Information

Database: mydb

Owner: cloudera

CreateTime: Tue Nov 20 21:57:37 PST 2018

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/mydb.db/mytab

Table Type: MANAGED_TABLE

Table Parameters:

transient_lastDdlTime 1542779857

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

serialization.format 1

Time taken: 0.133 seconds, Fetched: 28 row(s)

hive> describe extended mytab;

a int

b int

c int

Detailed Table Information Table(tableName:mytab, dbName:mydb, owner:cloudera, createTime:1542779857, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:a, type:int, comment:null), FieldSchema(name:b, type:int, comment:null), FieldSchema(name:c, type:int, comment:null)], location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/mydb.db/mytab, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1542779857}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

Time taken: 0.366 seconds, Fetched: 5 row(s)

hdfs dfs -ls /user/hive/warehouse/mydb.db/

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-11-20 21:57 /user/hive/warehouse/mydb.db/mytab

gedit file1.txt:

---------------

1,2,3

100,200,300

10,20,30

1000,2000,300

111,222,333

1111,2222,3333

1,2,3

11,22,33

hive> load data local inpath 'file1.txt' into table mytab;

// by default hive table expects '\001' ctrl+A as delimiter. if we pass csv the result will be :

hive> select * from mytab;

NULL NULL NULL

Time taken: 0.416 seconds, Fetched: 8 row(s)

hive> drop table mytab;

hive> create table mytab(a int, b int, c int) row format delimited fields terminated by ',';

Time taken: 0.157 seconds

hive> load data local inpath 'file1.txt' into table mytab;

Loading data to table mydb.mytab

Table mydb.mytab stats: [numFiles=1, totalSize=83]

Time taken: 0.138 seconds

hive> select * from mytab;

1 2 3

100 200 300

10 20 30

1000 2000 300

111 222 333

1111 2222 3333

1 2 3

11 22 33

Time taken: 0.059 seconds, Fetched: 8 row(s)

cat > file2.txt

2,3,4

20,30,40

200,300,400

2000,3000,4000

hive> use mydb;

Time taken: 0.076 seconds

hive> load data local inpath 'file2.txt' into table mytab;

Loading data to table mydb.mytab

Table mydb.mytab stats: [numFiles=2, totalSize=125]

Time taken: 0.556 seconds

hive> select * from mytab;

1 2 3

100 200 300

10 20 30

1000 2000 300

111 222 333

1111 2222 3333

1 2 3

11 22 33

2 3 4

20 30 40

200 300 400

2000 3000 4000

Time taken: 0.498 seconds, Fetched: 12 row(s)

// homogenous

[cloudera@quickstart ~]$ cat > file4.tx

Ravi,25,BTech#MTech

Rani,26,Msc#MBA#PhD

hive> create table tabx(name string, age int, qual array) row format delimited fields terminated by ',' collection items terminated by '#';

Time taken: 0.089 seconds

hive> load data local inpath 'file4.txt' into table tabx;

Loading data to table mydb.tabx

Table mydb.tabx stats: [numFiles=1, totalSize=40]

Time taken: 0.161 seconds

hive> select * from tabx;

Ravi 25 ["BTech","MTech"]

Rani 26 ["Msc","MBA","PhD"]

Time taken: 0.068 seconds, Fetched: 2 row(s)

// purpose of each element is different/

// hetrogeneous

cat > file5.txt

Rani,26,BTech#Nagarjuna University#2012#72

Ravi,29,Bsc#Osmania University#2011#69

hive> create table taby(name string, age int, qual struct) row format delimited fields terminated by ',' collection items terminated by '#';

Time taken: 0.033 seconds

hive> load data local inpath 'file5.txt' into table taby;

Loading data to table mydb.taby

Table mydb.taby stats: [numFiles=1, totalSize=82]

Time taken: 0.113 seconds

hive> select * from taby;

Rani 26 {"qual":"BTech","university":"Nagarjuna University","year":2012,"percentage":72}

Ravi 29 {"qual":"Bsc","university":"Osmania University","year":2011,"percentage":69}

Time taken: 0.049 seconds, Fetched: 2 row(s)

hive> select name,age,qual.qual,qual.university, qual.year,qual.percentage from taby;

Rani 26 BTech Nagarjuna University 2012 72

Ravi 29 Bsc Osmania University 2011 69

Time taken: 0.055 seconds, Fetched: 2 row(s)

cat file6.txt

Ravi,BTech$90#MTech$70,35

Mani,Bsc$80#Msc$90#MBA$60,25

hive> create table tabz(name string, qual map, age int) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated by '$';

Time taken: 0.034 seconds

hive> load data local inpath 'file6.txt' into table tabz;

Loading data to table mydb.tabz

Table mydb.tabz stats: [numFiles=1, totalSize=55]

Time taken: 0.121 seconds

hive> select * from tabz;

Ravi {"BTech":90,"MTech":70} 35

Mani {"Bsc":80,"Msc":90,"MBA":60} 25

Time taken: 0.033 seconds, Fetched: 2 row(s)

create table tabu(a int, b int, c int) stored as 'textFile'

create table tabu(a int, b int, c int) stored as 'sequenceFile'

create table tabu(a int, b int, c int) stored as 'rc'

create table tabu(a int, b int, c int) stored as 'orc'

create table tabu(a int, b int, c int) stored as 'parquet'

hive> create index idx1 on emp(id);

hive> create view myview1 as select name,age,income from info1;

hive> create view myview2 as select * from info1 where age > 25;

hive> create table tab1 (a int, b int, c int);

hive> insert into tab1 (a,b,c) values(1,2,3);

hive> select * from tab1;

1 2 3

hive> alter table tab1 add columns (d int, e int);

Time taken: 0.052 seconds

hive> select * from tab1;

1 2 3 NULL NULL

Time taken: 0.043 seconds, Fetched: 1 row(s)

hive> insert overwrite table tab1 select a,b,c,a+b+c, (a+b+c)/3 from tab1;

hive> select * from tab1;

1 2 3 6 2

Time taken: 0.035 seconds, Fetched: 1 row(s)

Drop statements:

----------------

drop table mytab;

drop index idx1;

drop view myview1;

drop database mydb;

hive> drop database mydb;

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database mydb is not empty. One or more tables exist.)

// Before dropping a database, delete all its contents...

cat info.txt

-------------

101,aaa,30000,m,11

102,bbb,40000,m,12

103,ccc,90000,f,13

104,ddd,10000,m,12

105,eee,20000,f,14

106,fff,20000,m,14

hive> create table emp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ',';

hive> load data local inpath 'info.txt' into table emp;

hive> select * from emp;

101 aaa 30000 m 11

102 bbb 40000 m 12

103 ccc 90000 f 13

104 ddd 10000 m 12

105 eee 20000 f 14

106 fff 20000 m 14

Time taken: 0.046 seconds, Fetched: 6 row(s)

hive> select * from emp;

101 aaa 30000 m 11

102 bbb 40000 m 12

103 ccc 90000 f 13

104 ddd 10000 m 12

105 eee 20000 f 14

106 fff 20000 m 14

Time taken: 0.033 seconds, Fetched: 6 row(s)

hive> select name,sal from emp;

aaa 30000

bbb 40000

ccc 90000

ddd 10000

eee 20000

fff 20000

Time taken: 0.043 seconds, Fetched: 6 row(s)

hive> select * from emp where sex='m';

101 aaa 30000 m 11

102 bbb 40000 m 12

104 ddd 10000 m 12

106 fff 20000 m 14

Time taken: 0.065 seconds, Fetched: 4 row(s)

hive> select * from emp where dno=11 or dno=12;

101 aaa 30000 m 11

102 bbb 40000 m 12

104 ddd 10000 m 12

Time taken: 0.045 seconds, Fetched: 3 row(s)

hive> select * from emp where dno in (11,12);

101 aaa 30000 m 11

102 bbb 40000 m 12

104 ddd 10000 m 12

Time taken: 0.04 seconds, Fetched: 3 row(s)

hive> select * from emp where dno != 11 and dno!= 13;

102 bbb 40000 m 12

104 ddd 10000 m 12

105 eee 20000 f 14

106 fff 20000 m 14

Time taken: 0.075 seconds, Fetched: 4 row(s)

hive> select * from emp where dno not in (11,13);

102 bbb 40000 m 12

104 ddd 10000 m 12

105 eee 20000 f 14

106 fff 20000 m 14

Time taken: 0.05 seconds, Fetched: 4 row(s)

Aggregation functions:

----------------------

sum(), avg(), max(), min(), count(), corr(), stddev(), cov()

udf

udaf

udtf

hive> select sum(sal) from emp;

210000

hive> select sum(sal), avg(sal), max(sal), min(sal), count(*) from emp;

210000 35000.0 90000 10000 6

Group By:

to get aggregations separately for each data group.

single group by sex:

f ?

m ?

multi group by dno, sex:

11 f ?

11 m ?

12 f ?

12 m ?

hive> select sex,sum(sal) from emp group by sex;

f 110000

m 100000

hive> select sum(sal), avg(sal), max(sal), min(sal), count(*) from emp group by sex;

110000 55000.0 90000 20000 2

100000 25000.0 40000 10000 4

hive> select dno,sex, sum(sal),avg(sal), count(*) from emp group by dno,sex;

11 m 30000 30000.0 1

12 m 50000 25000.0 2

13 f 90000 90000.0 1

14 f 20000 20000.0 1

14 m 20000 20000.0 1

hive> select dno, sum(sal) from emp where dno in (11,12,13) group by dno;

11 30000

12 50000

13 90000

hive> select dno, sum(sal) from emp group by dno having dno in (11,12,13);

11 30000

12 50000

13 90000

//whenever you want to apply groupby along with conditions, always use having

// performance is greater when we use having with groupby.

// reducer's burden is less here

In having filter, only grouping column and aggregated function is valid

Invalid query:

--------------

select dno, sum(sal) from info group by dno having city in ('Hyd','Del','Pune')

// It's invalid because, city is not in selected columns

Valid Query:

------------

select dno, sum(sal) from info where city in ('Hyd','Del','Pune') group by dno;

In having filter, only grouping column and aggregated function is valid.

select dno, sum(sal) from info

where city in ('Hyd','Del',Pune')

group by dno having dno in (11,12,13)

cat atm.txt

101,1000

102,2000

101,2000

101,3000

102,8000

103,1000

104,9000

hive> create table atm(acno int, amt int) row format delimited fields terminated by ',';

Time taken: 0.129 seconds

hive> load data local inpath 'atm.txt' into table atm;

Loading data to table mydb.atm

Table mydb.atm stats: [numFiles=1, totalSize=63]

Time taken: 0.162 seconds

hive> select * from atm;

101 1000

102 2000

101 2000

101 3000

102 8000

103 1000

104 9000

Time taken: 0.026 seconds, Fetched: 7 row(s)

// get acnos, who did more than 1 transaction.

hive> select acno,count(*) from atm group by acno having count(*) > 1;

101 3

102 2

hive> select acno from atm group by acno having count(*) > 1;

101

102

hive> select acno, sum(amt) from atm group by acno;

101 6000

102 10000

103 1000

104 9000

hive> select acno, sum(amt) as tot from atm group by acno having tot > 5000;

101 6000

102 10000

104 9000

Eliminating Duplicate Rows:

---------------------------

cat samp.txt;

Ravi,30

Mani,20

Ravi,30

Mani,20

Giri,50

hive> create table samp(name string, age int) row format delimited fields terminated by ',';

Time taken: 0.04 seconds

hive> load data local inpath 'samp.txt' into table samp;

Loading data to table mydb.samp

Table mydb.samp stats: [numFiles=1, totalSize=64]

Time taken: 0.177 seconds

hive> select * from samp;

Ravi 30

Mani 20

Ravi 30

Mani 20

Giri 50

Time taken: 0.026 seconds, Fetched: 8 row(s)

hive> select distinct(name), age from samp;

Giri 50

Mani 20

Ravi 30

hive> load data local inpath 'samp2.txt' overwrite table samp;

hive> select name,age,count(*) from samp group by name,age;

Giri 50 3

Mani 20 2

Ravi 30 3

hive> select name,age from samp group by name, age;

Giri 50

Mani 20

Ravi 30

cat tab1.txt

101,1000

102,null

103,2000

104,null

hive> create table tab1 (id int, sal int) row format delimited fields terminated by ',';

Time taken: 0.025 seconds

hive> load data local inpath 'tab1.txt' into table tab1;

Loading data to table mydb.tab1

Table mydb.tab1 stats: [numFiles=1, totalSize=36]

Time taken: 0.118 seconds

hive> select * from tab1;

101 1000

102 NULL

103 2000

104 NULL

Time taken: 0.052 seconds, Fetched: 4 row(s)

// including null values

hive> select count(*) from tab1;

// excluding null values

hive> select count(sal) from tab1;

hive> select count(*) from tab1 where sal is null;

Hive Unions:

------------

To merge datasets (tables)

In SQL,

2 types of Unions

union - Doesn't allow duplicates

union all - allows duplicate rows.

In HQL,

we have only "union all"

emp1

emp2

emp3

case#1:

schema of both tables should be same.

create table newTab like emp1;

insert into [overwrite] table newTab

select * from emp1

union all

select * from emp2

union all

select * from emp3

case#2:

schema of tables are different

tab1 -> name,city

tab2 -> city,name

select name,city from tab1

union all

select name,city from tab2

// keep the column order of all the query should be same

case#3:

if tables have different new fields.

a field is available in Table#1

but the different field is available in Table#2

tabX -> name,city,sex (age is missing here)

tabY -> name,city,age (sex is missing here)

create tabel tabZ (name string, age int, sex string, city string);

insert into table tabZ

select name,null as age, sex, city from tabX

union all

select name,age,null as sex, city from tabY

case#4:

file1 -> 101,aaaa,10000 (comma separated)

....

file2 -> 201,bbbb,30000 (comma separated)

....

file3 -> 301 40000 aaaaa (tab separated)

for file#1, file#2 -> structure is same (comma separated)

for file#3 -> structure is different (tab separated)

structure means field names and order of them.

way#1:

create table tab1(id int, name string, sal int) row format delimited fields terminated by ',';

create table tab2(id int, sal int, name string) row format delimited fields terminated by '\t';

load data local inpath 'file1.txt' into table tab1;

load data local inpath 'file2.txt' into table tab1;

load data local inpath 'file3.txt' into table tab2;

tab1 -> id,name,sal

tab2 -> id,sal,name

same column with different column order

create table info like tab1;

insert into table info

select id,name,sal from tab1

union all

select id,name,sal from tab2

file1.txt,file2.txt content pushed into tab1

file3.txt content pushed into tab2

finally we merged tab1 and tab2 together to make info.

tab1 --> 1 Lakh rows

tab2 --> 10000 rows

Join in Hive:

-------------

Inner joins

Outer Joins

Left outer

Right outer

Full Outer

Inner join:

select l.x, r.x from tab1 l join tab2 r on (l.x = r.x);

Left outer:

select l.x, r.x from tab1 l left outer join tab2 r on (l.x = r.x);

Right outer:

select l.x, r.x from tab1 l right outer join tab2 r on (l.x = r.x);

Full outer:

select l.x, r.x from tab1 l full outer join tab2 r on (l.x = r.x);

cat emp

101,aaaa,40000,m,11

102,bbbb,50000,f,12

103,cccc,90000,m,12

104,dddd,10000,f,13

105,eeee,20000,m,11

106,ffff,20000,m,14

cat dept

11,marketing,hyd

12,hr,del

13,finance,hyd

20,admin,del

21,production,hyd

joins are good compared to sub-queries

but if we have 1 Crore records, joins are bad

hive> create database joins;

Time taken: 0.483 seconds

hive> use joins;

Time taken: 0.026 seconds

hive> create table emp(id int, name string, sal int, sex string, dno int) row format delimited fields terminated by ',';

hive> load data local inpath 'emp' into table emp;

hive> select * from emp;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.435 seconds, Fetched: 8 row(s)

hive> create table dept (dno int, dname string, dloc string) row format delimited fields terminated by ',';

Time taken: 0.1 seconds

hive> load data local inpath 'dept' into table dept;

Loading data to table joins.dept

Table joins.dept stats: [numFiles=1, totalSize=73]

Time taken: 0.159 seconds

hive> select * from dept;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.059 seconds, Fetched: 5 row(s)

hive> select * from emp;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.091 seconds, Fetched: 8 row(s)

hive> select * from dept;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.039 seconds, Fetched: 5 row(s)

hive> create table edinfo(id int, name string, sal int, sex string, dname string, dloc string, dno1 int, dno2 int) row format delimited fields terminated by ',';

Time taken: 0.063 seconds

hive> insert overwrite table edinfo select id,name,sal,sex,dname,dloc,l.dno,r.dno from emp l full outer join dept r on (l.dno = r.dno);

hive> select * from edinfo;

109 nnnn 50000 f marketing hyd 11 11

105 eeee 20000 m marketing hyd 11 11

101 aaaa 40000 m marketing hyd 11 11

103 cccc 90000 m hr del 12 12

102 bbbb 50000 f hr del 12 12

104 dddd 10000 f finance hyd 13 13

106 ffff 20000 m NULL NULL 14 NULL

107 mmmm 60000 m NULL NULL 15 NULL

NULL NULL NULL NULL admin del NULL 20

NULL NULL NULL NULL production hyd NULL 21

hdfs dfs -cat /user/hive/warehouse/joins.db/edinfo/000000_0

109,nnnn,50000,f,marketing,hyd,11,11

105,eeee,20000,m,marketing,hyd,11,11

101,aaaa,40000,m,marketing,hyd,11,11

103,cccc,90000,m,hr,del,12,12

102,bbbb,50000,f,hr,del,12,12

104,dddd,10000,f,finance,hyd,13,13

106,ffff,20000,m,\N,\N,14,\N

107,mmmm,60000,m,\N,\N,15,\N

\N,\N,\N,\N,admin,del,\N,20

\N,\N,\N,\N,production,hyd,\N,21

hive> select dloc, sum(sal) from edinfo group by dloc;

NULL 80000

del 140000

hyd 120000

[cloudera@quickstart ~]$ cat emp

101,aaaa,40000,m,11

102,bbbb,50000,f,12

103,cccc,90000,m,12

104,dddd,10000,f,13

105,eeee,20000,m,11

106,ffff,20000,m,14

107,mmmm,60000,m,15

109,nnnn,50000,f,11

[cloudera@quickstart ~]$ cat dept

11,marketing,hyd

12,hr,del

13,finance,hyd

20,admin,del

21,production,hyd

hive> select * from emp;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.041 seconds, Fetched: 8 row(s)

hive> select * from dept;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.036 seconds, Fetched: 5 row(s)

hive> create table swengineers (id int, name string, sal int, sex string, pid int) row format delimited fields terminated by ',';

Time taken: 0.155 seconds

hive> load data local inpath 'emp' into table swengineers;

Loading data to table joins.swengineers

Table joins.swengineers stats: [numFiles=1, totalSize=160]

Time taken: 0.103 seconds

hive> select * from swengineers;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.028 seconds, Fetched: 8 row(s)

hive> create table projects (pid int, pname string, loc string) row format delimited fields terminated by ',';

Time taken: 0.032 seconds

hive> load data local inpath 'dept' into table projects;

Loading data to table joins.projects

Table joins.projects stats: [numFiles=1, totalSize=73]

Time taken: 0.11 seconds

hive> select * from projects;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.03 seconds, Fetched: 5 row(s)

hive> select * from projects;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.03 seconds, Fetched: 5 row(s)

hive> select * from swengineers;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.032 seconds, Fetched: 8 row(s)

hive> select * from projects;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.077 seconds, Fetched: 5 row(s)

11,12,13 guys are working in projects

14,15 are recruited but not into projects

20,12 - project started but no employees recruited

working team -- inner join

11,12,13

bench team

14,15

bench project

20,12

full outer join - i don't want to miss any info

hive> create table prengineers (pid1 int, pid2 int, sal int);

hive> insert overwrite table prengineers select l.pid, r.pid, sal from swengineers l full outer join projects r on (l.pid = r.pid);

hive> select * from prengineers;

11 11 50000

11 11 20000

11 11 40000

12 12 90000

12 12 50000

13 13 10000

14 NULL 20000

15 NULL 60000

NULL 20 NULL

NULL 21 NULL

Time taken: 0.041 seconds, Fetched: 10 row(s)

first 5 records are working engineers rows

next 2 bench team

next 2 bench project

hive> create table transformed (stat string, sal int);

hive> insert overwrite table transformed select if(pid2 is null, 'BenchTeam', if(pid1 is null,'BenchProject','Working')), sal from prengineers;

hive> select * from transformed;

Working 50000

Working 20000

Working 40000

Working 90000

Working 50000

Working 10000

BenchTeam 20000

BenchTeam 60000

BenchProject NULL

hive> insert overwrite table transformed select stat, if(sal is null, 0,sal) from transformed;

hive> select * from transformed;

Working 50000

Working 20000

Working 40000

Working 90000

Working 50000

Working 10000

BenchTeam 20000

BenchTeam 60000

BenchProject 0

Time taken: 0.03 seconds, Fetched: 10 row(s)

hive> create table prsummary(stat string, tot int) row format delimited fields terminated by ',';

hive> insert overwrite table prsummary select stat, sum(sal) from transformed group by stat;

hive> select * from prsummary;

BenchProject 0

BenchTeam 80000

Working 260000

Time taken: 0.03 seconds, Fetched: 3 row(s)

hive> select * from swengineers;

101 aaaa 40000 m 11

102 bbbb 50000 f 12

103 cccc 90000 m 12

104 dddd 10000 f 13

105 eeee 20000 m 11

106 ffff 20000 m 14

107 mmmm 60000 m 15

109 nnnn 50000 f 11

Time taken: 0.042 seconds, Fetched: 8 row(s)

hive> select * from projects;

11 marketing hyd

12 hr del

13 finance hyd

20 admin del

21 production hyd

Time taken: 0.029 seconds, Fetched: 5 row(s)

hive> select * from prengineers;

11 11 50000

11 11 20000

11 11 40000

12 12 90000

12 12 50000

13 13 10000

14 NULL 20000

15 NULL 60000

NULL 20 NULL

NULL 21 NULL

Time taken: 0.031 seconds, Fetched: 10 row(s)

hive> select * from transformed;

Working 50000

Working 20000

Working 40000

Working 90000

Working 50000

Working 10000

BenchTeam 20000

BenchTeam 60000

BenchProject 0

Time taken: 0.033 seconds, Fetched: 10 row(s)

hive> select * from prsummary;

BenchProject 0

BenchTeam 80000

Working 260000

Time taken: 0.042 seconds, Fetched: 3 row(s)

we don't want to miss anything - so we used full outer join

OLTP - full outer join is not recommended

Batch Processing, analytics - full outer is recommended

cartersian product is not recommended in OLTP

Cartesian product is widely used in Batch Processing.

-----------------------------------------------------------------

Hive

To logon to mysql via Command Line Interface:

mysql -u root -pcloudera

To List Databases:

mysql> show databases;

+--------------------+

| Database |

+--------------------+

| information_schema |

| cm |

| firehose |

| hue |

| metastore |

| mysql |

| nav |

| navms |

| oozie |

| retail_db |

| rman |

| sentry |

+--------------------+

12 rows in set (0.11 sec)

Inside the Hive.

How to create a Database

hive> create database sample_programs;

Time taken: 3.196 seconds

To list databases within Hive:

hive> show databases;

default

sample_programs

Hive Database, Tables will be stored in the following hdfs location:

hdfs dfs -ls /user/hive/warehouse

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-07-25 03:27 /user/hive/warehouse/sample_programs.db

open current DB:

use sample_programs;

To create a table in Hive within sample_programs db:

create table employees (std int, fname string, lname string, age int, mob string, city string) row format delimited fields terminated by ',';

To copy employees.txt file into /user/hive:

hdfs dfs -put /home/cloudera/employees.txt /user/hive/employees.txt

To view the content of employees.txt in command line interface:

hdfs dfs -cat /user/hive/employees.txt

001,Rajiv,Reddy,21,988383,Hyderabad

002,Siddharth,Malhothra,22,9932323,Kolkatta

003,Rajesh,Rakul,22,999333,Kolkatta

004,Preeti,Singh,21,9234243,Hyderabad

005,Trupthi,Mohanthy,23,988833,Bhuwaneshwar

006,Archana,Mishra,24,988323,Chennai

007,Komal,Nayak,23,98888,Chennai

008,Bharathi,Nambiar,22,9888888,Hyderabad

009,Bharathi,Nambiar,21,9888888,Chennai

005,Trupthi,Mohanthy,21,988833,Bhuwaneshwar

To load data into existing hive table:

hive> load data inpath '/user/hive/employees.txt' into table employees;

Loading data to table sample_programs.employees

Table sample_programs.employees stats: [numFiles=1, totalSize=394]

To display the content of a table

hive> select * from employees;

1 Rajiv Reddy 21 988383 Hyderabad

2 Siddharth Malhothra 22 9932323 Kolkatta

3 Rajesh Rakul 22 999333 Kolkatta

4 Preeti Singh 21 9234243 Hyderabad

5 Trupthi Mohanthy 23 988833 Bhuwaneshwar

6 Archana Mishra 24 988323 Chennai

7 Komal Nayak 23 98888 Chennai

8 Bharathi Nambiar 22 9888888 Hyderabad

9 Bharathi Nambiar 21 9888888 Chennai

5 Trupthi Mohanthy 21 988833 Bhuwaneshwar

Time taken: 0.603 seconds, Fetched: 10 row(s)

Now check /user/hive/warehouse:

hdfs dfs -ls /user/hive/warehouse/sample_programs.db/employees

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 394 2018-07-25 03:39 /user/hive/warehouse/sample_programs.db/employees/employees.txt

hdfs dfs -cat /user/hive/warehouse/sample_programs.db/employees/employees.txt

001,Rajiv,Reddy,21,988383,Hyderabad

002,Siddharth,Malhothra,22,9932323,Kolkatta

003,Rajesh,Rakul,22,999333,Kolkatta

004,Preeti,Singh,21,9234243,Hyderabad

005,Trupthi,Mohanthy,23,988833,Bhuwaneshwar

006,Archana,Mishra,24,988323,Chennai

007,Komal,Nayak,23,98888,Chennai

008,Bharathi,Nambiar,22,9888888,Hyderabad

009,Bharathi,Nambiar,21,9888888,Chennai

005,Trupthi,Mohanthy,21,988833,Bhuwaneshwar

Copy women.txt from local linux to hdfs:

hdfs dfs -put /home/cloudera/women.txt /user/hive/women.txt

Create a folder in hive named as women_employees:

hdfs dfs -mkdir /user/hive/women_employees

Move a file between hdfs locations:

hdfs dfs -mv /user/hive/women.txt /user/hive/women_employees/women.txt

To view the content of women.txt which is stored in hdfs

hdfs dfs -cat /user/hive/women_employees/women.txt

001,Nithya,Duraisamy,31,Manager,9586452156,Hyderabad

002,Nandhini,Babu,28,Assistant Manager,Delhi

003,Madhuri,Nathan,51,VP,999333323,Hyderabad

004,Kavitha,Manoharan,45,AVP,992342344,Hyderabad

005,Vijaya,Kandhasamy,45,AVP,23452344,Noida

006,Aarthi,Raj,28,Assistant Manager,99234234,Chennai

007,Lavanya,Sankar,23,Senior Engineer,9923444,Chennai

008,Meena,Baskar,51,VP,2342344444,Hyderabad

009,Gayathri,Raghu,22,Engineer,323423444,Chennai

010,Thenmozhi,Rajan,45,AVP,234234234,Noida

Cretae external table:

create external table women_employees(sid int, fname string, lname string, age int, mob string, city string) row format delimited fields terminated by ',' location '/user/hive/women_employees';

// Here we need to specify hdfs folder name not a file name.

Change current database:

hive> use sample_programs;

Time taken: 0.02 seconds

Display tables in sample_programs database:

hive> show tables;

employees

women_employees

Time taken: 0.026 seconds, Fetched: 2 row(s)

Display the content of a table:

hive> select * from women_employees;

1 Nithya Duraisamy 31 Manager 9586452156

2 Nandhini Babu 28 Assistant Manager Delhi

3 Madhuri Nathan 51 VP 999333323

4 Kavitha Manoharan 45 AVP 992342344

5 Vijaya Kandhasamy 45 AVP 23452344

6 Aarthi Raj 28 Assistant Manager 99234234

7 Lavanya Sankar 23 Senior Engineer 9923444

8 Meena Baskar 51 VP 2342344444

9 Gayathri Raghu 22 Engineer 323423444

10 Thenmozhi Rajan 45 AVP 234234234

NULL NULL NULL NULL NULL NULL

Time taken: 0.097 seconds, Fetched: 11 row(s)

hive>

Load data into table:

load data inpath '/user/hive/women_employees/women.txt' into table women_employees;

Loading data to table sample_programs.women_employees

Table sample_programs.women_employees stats: [numFiles=1, numRows=0, totalSize=480, rawDataSize=0]

Time taken: 0.772 seconds

External table info is not there in warehouse location

hdfs dfs -ls /user/hive/warehouse/sample_programs.db

Found 1 items

drwxrwxrwx - cloudera supergroup 0 2018-07-25 03:42 /user/hive/warehouse/sample_programs.db/employees

It has internal table (employees) information only

creating partition in hive table:

create table emp(fname string, mob string) PARTITIONED BY (city string);

set the following property in hive:

set hive.exec.dynamic.partition.mode=nonstrict;

Take rows from employees table and put them into emp PARTITIONED table.

insert overwrite table emp PARTITION(city) SELECT fname,mob,city from employees;

Loading partition {city=Hyderabad}

Loading partition {city=Bhuwaneshwar}

Loading partition {city=Chennai}

Loading partition {city=Kolkatta}

Time taken for adding to write entity : 2

Partition sample_programs.emp{city=Bhuwaneshwar} stats: [numFiles=1, numRows=2, totalSize=30, rawDataSize=28]

Partition sample_programs.emp{city=Chennai} stats: [numFiles=1, numRows=3, totalSize=44, rawDataSize=41]

Partition sample_programs.emp{city=Hyderabad} stats: [numFiles=1, numRows=3, totalSize=45, rawDataSize=42]

Partition sample_programs.emp{city=Kolkatta} stats: [numFiles=1, numRows=2, totalSize=32, rawDataSize=30]

MapReduce Jobs Launched:

See the partition folders in hdfs:

hdfs dfs -ls /user/hive/warehouse/sample_programs.db/emp/

Found 4 items

drwxrwxrwx - cloudera supergroup 0 2018-07-25 04:44 /user/hive/warehouse/sample_programs.db/emp/city=Bhuwaneshwar

drwxrwxrwx - cloudera supergroup 0 2018-07-25 04:44 /user/hive/warehouse/sample_programs.db/emp/city=Chennai

drwxrwxrwx - cloudera supergroup 0 2018-07-25 04:44 /user/hive/warehouse/sample_programs.db/emp/city=Hyderabad

drwxrwxrwx - cloudera supergroup 0 2018-07-25 04:44 /user/hive/warehouse/sample_programs.db/emp/city=Kolkatta

hive> select * from emp;

Trupthi 988833 Bhuwaneshwar

Archana 988323 Chennai

Komal 98888 Chennai

Bharathi 9888888 Chennai

Rajiv 988383 Hyderabad

Preeti 9234243 Hyderabad

Bharathi 9888888 Hyderabad

Siddharth 9932323 Kolkatta

Rajesh 999333 Kolkatta

Create a bucket in hive:

create table ep(fname string,mob string, city string) CLUSTERED BY(city) into 3 buckets row format delimited fields terminated by ',';

Load data into ep:

from emp insert overwrite table ep SELECT fname,mob,city;

Display the content from ep table:

select * from ep;

Trupthi 988833 Bhuwaneshwar

Archana 988323 Chennai

Komal 98888 Chennai

Bharathi 9888888 Chennai

Rajiv 988383 Hyderabad

Preeti 9234243 Hyderabad

Bharathi 9888888 Hyderabad

Siddharth 9932323 Kolkatta

Rajesh 999333 Kolkatta

Time taken: 0.07 seconds, Fetched: 10 row(s)

To view bucket files :

hdfs dfs -ls /user/hive/warehouse/sample_programs.db/ep

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 249 2018-07-25 04:54 /user/hive/warehouse/sample_programs.db/ep/000000_0

-rwxrwxrwx 1 cloudera supergroup 249 2018-07-25 04:54 /user/hive/warehouse/sample_programs.db/ep/000001_0

-rwxrwxrwx 1 cloudera supergroup 249 2018-07-25 04:54 /user/hive/warehouse/sample_programs.db/ep/000002_0

Create a view in hive:

hive> create view age as select * from employees where age > 22;

Time taken: 0.193 seconds

Query data against view:

hive> select * from age;

5 Trupthi Mohanthy 23 988833 Bhuwaneshwar

6 Archana Mishra 24 988323 Chennai

7 Komal Nayak 23 98888 Chennai

2 types of indexes in hive

compact index, bit map index

Make bit map index:

hive> create index mobile on table employees(mob) as 'bitmap' with deferred rebuild;

Make compact index:

hive> create index city on table employees(city) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

To view list of indexes created earlier:

hive>show formatted index on employees;

idx_name tab_name col_names idx_tab_name idx_type comment

mobile employees mob sample_programs__employees_mobile__ bitmap

city employees city sample_programs__employees_city__ compact

How to drop an existing index:

hive> drop index mobile on employees;

hive> show formatted index on employees; // Here mobile index is not there.

idx_name tab_name col_names idx_tab_name idx_type comment

city employees city sample_programs__employees_city__compact

Time taken: 0.083 seconds, Fetched: 4 row(s)

hive>

Sort order:

select * from employees order by age;

5 Trupthi Mohanthy 21 988833 Bhuwaneshwar

9 Bharathi Nambiar 21 9888888 Chennai

4 Preeti Singh 21 9234243 Hyderabad

1 Rajiv Reddy 21 988383 Hyderabad

8 Bharathi Nambiar 22 9888888 Hyderabad

3 Rajesh Rakul 22 999333 Kolkatta

2 Siddharth Malhothra 22 9932323 Kolkatta

7 Komal Nayak 23 98888 Chennai

5 Trupthi Mohanthy 23 988833 Bhuwaneshwar

6 Archana Mishra 24 988323 Chennai

Grouping operation:

select age,count(*) from employees group by age;

21 4

22 3

23 2

24 1

Sort order:

select * from employees order by fname desc;

5 Trupthi Mohanthy 21 988833 Bhuwaneshwar

5 Trupthi Mohanthy 23 988833 Bhuwaneshwar

2 Siddharth Malhothra 22 9932323 Kolkatta

1 Rajiv Reddy 21 988383 Hyderabad

3 Rajesh Rakul 22 999333 Kolkatta

4 Preeti Singh 21 9234243 Hyderabad

7 Komal Nayak 23 98888 Chennai

9 Bharathi Nambiar 21 9888888 Chennai

8 Bharathi Nambiar 22 9888888 Hyderabad

6 Archana Mishra 24 988323 Chennai

hive> select * from employees sort by fname desc;

5 Trupthi Mohanthy 21 988833 Bhuwaneshwar

5 Trupthi Mohanthy 23 988833 Bhuwaneshwar

2 Siddharth Malhothra 22 9932323 Kolkatta

1 Rajiv Reddy 21 988383 Hyderabad

3 Rajesh Rakul 22 999333 Kolkatta

4 Preeti Singh 21 9234243 Hyderabad

7 Komal Nayak 23 98888 Chennai

9 Bharathi Nambiar 21 9888888 Chennai

8 Bharathi Nambiar 22 9888888 Hyderabad

6 Archana Mishra 24 988323 Chennai

To view the structure of a table:

describe employees;

std int

fname string

lname string

age int

mob string

city string

Hive Aru

-----------

To logon to Hive:

$ hive

To clear the screen:

hive > !clear;

To display all databases:

hive > show databases;

To create a database:

hive > create database batch_may;

To create a tabele:

hive > create table student (id int, name string, location string) row format delimited fields terminated by ',' stored as textFile;

To view the schema of a table:

hive > desc formatted student;

create a folder in hdfs named USECASE

$ hdfs dfs -mkdir /USECASE

create student.txt file in local:

$ cat > student.txt

001,Rajiv,Hyd

002,Siddhu,Kol

003,Rajesh,Del,

004,Dinesh,Chen,

005,Kumar,Mum

006,Ganesh,Aur

007,Ram,Blr

copy student.txt file from local to hdfs

$ hdfs dfs -put student.txt /USECASE/student.txt

Display the uploaded file content in hdfs:

hdfs dfs -cat /USECASE/student.txt

001,Rajiv,Hyd

002,Siddhu,Kol

003,Rajesh,Del,

004,Dinesh,Chen,

005,Kumar,Mum

006,Ganesh,Aur

007,Ram,Blr

To load data to student table of batch_may db of Hive:

hive> load data inpath '/USECASE/student.txt' into table student;

Location of data text file in hdfs :

/user/hive/warehouse/student

RDBMS:

Schema on Write

Hive:

Schema on Read

No update,Transactions, Indexes

HDFS environment setup:

Making folder and subfolder and copy local linux file into hdfs:

$ hdfs dfs -mkdir /USECASE

$ hdfs dfs -mkdir /USECASE/student_ext

$ hdfs dfs -put student.txt /USECASE/student_ext/

Upload file from local to hdfs :

$ hdfs dfs -put student.txt /USECASE/student_ext.txt

Make External table in hive:

create external table student_ext (id int, name string, location string) row format delimited fields terminated by ',' stored as textfile location '/USECASE/student_ext';

select * from student_ext;

1 Rajiv Hyd

2 Siddhu Kol

3 Rajesh Del

4 Dinesh Chen

5 Kumar Mum

6 Ganesh Aur

7 Ram Blr

Create one more file in local linux :

cat > student1.txt

008,Arivu,Mdr

009,Raji,Pltr

010,Selvi,Chn

Create one more file in local linux :

cat > student2.txt

011,Arun,Lnd

012,Meena,Karai

Copy both files into hdfs from local linux;

[cloudera@quickstart ~]$ hdfs dfs -put student1.txt /USECASE/student_ext/

[cloudera@quickstart ~]$ hdfs dfs -put student2.txt /USECASE/student_ext/

Now run the same select query in hive again:

select * from student_ext;

1 Rajiv Hyd

2 Siddhu Kol

3 Rajesh Del

4 Dinesh Chen

5 Kumar Mum

6 Ganesh Aur

7 Ram Blr // 7 rows taken from student.txt

8 Arivu Mdr

9 Raji Pltr

10 Selvi Chn // 3 rows taken student1.txt

11 Arun Lnd

12 Meena Karai // 3 rows taken from student2.txt

Internal table's details:

describe formatted student;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/student

Table Type: MANAGED_TABLE

External table's details:

describe formatted student_ext;

Location: hdfs://quickstart.cloudera:8020/USECASE/student_ext

Table Type: EXTERNAL_TABLE

Here we are going to create a local text file and also going to read local files within hive environment

after reading the file hive will put the content into warehouse (internal to hive)

Create a local file in linux:

cat > mylocal.txt

Arun,Sanjana,Dubshmash

Trichy,Ramesh,TikTok

Chitra,Kajal,Musically

Randy,Kichdy,Youtube

Amith,Agarwal,Blogger

hive>!clear;

Create a regular table in hive:

hive> create table myTableloc (FirstName string, LastName String,SocialMedia string) row format delimited fields terminated by ',' stored as textfile;

Load local linux mylocal.txt file into hive :

hive> load data local inpath '/home/cloudera/mylocal.txt' into table myTableloc;

hive> select * from myTableloc;

Arun Sanjana Dubshmash

Trichy Ramesh TikTok

Chitra Kajal Musically

Randy Kichdy Youtube

Amith Agarwal Blogger

hive> describe formatted myTableloc;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytableloc

Take data from existing table and make new table:

hive> create table myTableloc1 as select * from myTableloc;

hive> select * from myTableloc1;

Arun Sanjana Dubshmash

Trichy Ramesh TikTok

Chitra Kajal Musically

Randy Kichdy Youtube

Amith Agarwal Blogger

Change the name of the table in hive:

hive> alter table mytableloc1 rename to my1;

Rename a table:

hive> show tables;

mytableloc

mytableloc1 // we are goint to rename this table

student

student_ext

Time taken: 0.022 seconds, Fetched: 4 row(s)

hive> alter table mytableloc1 rename to my1;

Time taken: 0.262 seconds

hive> show tables;

my1 // new name applied

mytableloc

student

student_ext

How to rename a column?

hive> alter table mytableloc change socialmedia media string;

describe mytableloc;

firstname string

lastname string

socialmedia string

Time taken: 0.081 seconds, Fetched: 3 row(s)

hive> alter table mytableloc change socialmedia media string;

Time taken: 0.215 seconds

hive>

hive> describe mytableloc;

firstname string

lastname string

media string // after renaming socialmedia column into media

How to change data type of existing column?

hive> describe student;

id int // current data type is int

name string

location string

hive> alter table student change id student_id string;

hive> describe student;

student_id string // now data type changed into string

name string

location string

Drop a table in Hive:

if we drop internal table, both schema and data stored in warehouse folder will be deleted permanently

if we drop external table, only schema will be deleted. and hdfs file(s) wont be affected

hive> show tables;

my1 // Going to drop it

mytableloc

student

student_ext

hive> describe my1;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/my1

In hdfs, File is there

$ hdfs dfs -ls /user/hive/warehouse/my1Found 1 items

-rwxrwxrwx 1 cloudera supergroup 110 2018-08-29 03:02 /user/hive/warehouse/my1/000000_0

Going to drop internal table :

hive> drop table my1;

After drop a table physical file removed,

hdfs dfs -ls /user/hive/warehouse/my1

ls: `/user/hive/warehouse/my1': No such file or directory

hive> show tables;

mytableloc

student

student_ext // Going to delete it immediate

// my1 is missing here as we deleted it.

We are going to delete external table:

hive> describe formatted student_ext;

Location: hdfs://quickstart.cloudera:8020/USECASE/student_ext

Display the folder of hdfs :

$ hdfs dfs -ls /USECASE/student_ext

Found 3 items

-rw-r--r-- 1 cloudera supergroup 103 2018-08-28 22:10 /USECASE/student_ext/student.txt

-rw-r--r-- 1 cloudera supergroup 42 2018-08-29 02:27 /USECASE/student_ext/student1.txt

-rw-r--r-- 1 cloudera supergroup 29 2018-08-29 02:30 /USECASE/student_ext/student2.txt

To drop external table:

hive> drop table student_ext;

hive> show tables;

mytableloc

student

// student_ext table is dropped and it is not in the list after dropping

Just now we dropped a table in hive then also the folder and it's corresponding files will be residing in hdfs:

hdfs dfs -ls /USECASE/student_ext

hive> truncate table student_ext

cannot truncate non managed table // external table cannot be truncated

hive> truncate table student

internal table can be truncated.

****We can truncate only internal (managed) table but not external table

Create Table AS

CTAS:

hive> create table student as select * from default.student;

hive> create table myTableloc as select * from default.mytableloc;

Map Join:

Keep the small table in memory and do perform join to get better performance

If we want to join Student and Department table, definitely Department table is a small table which may have 20 Rows. But Student table may have

large number of rows like 10 Mn.

Instead of hitting disk I/O for Department table again and again, we can tell Hive to keep Department table in Memory

that will save seek time , disk I/O against Department table.

hive> select /*+ MAPJOIN(Department)*/ * from Student inner join Department on Department.ID = Student.DepartmentID

Namespaces Databases

schemas Tables

Directories Partitions

Files Buckets or clusters

Creating a new table with partition enabled by City:

hive> create table student_part(id string, name string) partitioned by (city string) row format delimited fields terminated by ',' stored as textFile;

describe formatted student_part:

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/batch_may.db/student_part // before renaming

Change the name of the table:

hive> alter table student_part rename to person_part;

describe person_part; // describe which includes partition info too

id string

name string

city string

# Partition Information

# col_name data_type comment

city string

describe formatted person_part;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/batch_may.db/person_part

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/batch_may.db/

drwxrwxrwx - cloudera supergroup 0 2018-08-29 18:34 /user/hive/warehouse/batch_may.db/mytableloc

drwxrwxrwx - cloudera supergroup 0 2018-08-30 01:17 /user/hive/warehouse/batch_may.db/person_part // after renaming

drwxrwxrwx - cloudera supergroup 0 2018-08-29 18:32 /user/hive/warehouse/batch_may.db/student

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/batch_may.db/person_part/

// currently no records found so folder is empty

set the following in hive:

hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode = nonstrict;

step #1:

create a normal table in hive:

hive> create table student(id int, name string, location string);

insert some rows in student table of batch_may db;

hive> insert into student(id,name,location) values (101,'sara','MDR');

hive> insert into student(id,name,location) values (102,'saran','CHN');

hive> insert into student(id,name,location) values (103,'Maran','TRC');

step #2:

create a partioned table:

hive> create table student_part(name string,location string) partitioned by (id int) row format delimited fields terminated by ',' stored as textFile;

step #3:

Feed data from student table to student_part table:

hive> insert into table student_part partition(id) select name,location,id from student;

Loading partition {id=102}

Loading partition {id=103}

Loading partition {id=101}

hive> describe formatted student_part;

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/batch_may.db/student_part

see the contents of that folder in hdfs:

hdfs dfs -ls /user/hive/warehouse/batch_may.db/student_part/

drwxrwxrwx - cloudera supergroup 0 2018-09-02 01:01 /user/hive/warehouse/batch_may.db/student_part/id=101

drwxrwxrwx - cloudera supergroup 0 2018-09-02 01:01 /user/hive/warehouse/batch_may.db/student_part/id=102

drwxrwxrwx - cloudera supergroup 0 2018-09-02 01:01 /user/hive/warehouse/batch_may.db/student_part/id=103

hdfs dfs -ls /user/hive/warehouse/batch_may.db/student_part/id=101

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 9 2018-09-02 01:01 /user/hive/warehouse/batch_may.db/student_part/id=101/000000_0

see the content of that file :

hdfs dfs -cat /user/hive/warehouse/batch_may.db/student_part/id=101/000000_0

sara,MDR

static partition:

----------------

create table student_part_stat (id int, name string) partitioned by (location string) row format delimited fields terminated by ',' stored as textFile;

folder cleaning:

[cloudera@quickstart ~]$ hdfs dfs -rm -r /USECASE

Deleted /USECASE

Create a folder structure in hdfs:

---------------------------

[cloudera@quickstart ~]$ hdfs dfs -mkdir /USECASE

[cloudera@quickstart ~]$ hdfs dfs -mkdir /USECASE/test

[cloudera@quickstart ~]$ hdfs dfs -mkdir /USECASE/test/student;

create .txt files in local linux

--------------------------------

mkdir sara;

$ cd sara;

pwd

/home/cloudera/sara

create multiple files in local linux:

--------------------------------------

[cloudera@quickstart sara]$ cat > a001.txt

001,sara,MDR

002,Lara,WestIndies

003,Arjuna,Srilanka

[cloudera@quickstart sara]$ cat > a002.txt

004,kalai,ottangadu

005,Muthu,Aathalur

006,Neels,Perai

[cloudera@quickstart sara]$ cat >a003.txt

007,Anbu,Melai

008,Naga,Singa

009,Veera,Singa

a001.txt a002.txt a003.txt

Transfer local files into hdfs:

---------------------------------

[cloudera@quickstart sara]$ hdfs dfs -put *.* /USECASE/test/student/

Display the hdfs folder contents:

-----------------------------------

[cloudera@quickstart sara]$ hdfs dfs -ls /USECASE/test/student/

Found 3 items

-rw-r--r-- 1 cloudera supergroup 53 2018-09-02 01:49 /USECASE/test/student/a001.txt

-rw-r--r-- 1 cloudera supergroup 55 2018-09-02 01:49 /USECASE/test/student/a002.txt

-rw-r--r-- 1 cloudera supergroup 46 2018-09-02 01:49 /USECASE/test/student/a003.txt

Bigdata_Spark_Consultant

Wednesday, February 19, 2020

sqoop commands brushup

Friday, February 14, 2020

Hbase basic operation

Wednesday, February 12, 2020

Hive Brushup

Python Challenges Program

Blog Archive

Labels