Alter Table
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
Alter Table Storage Properties
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
These statements change the table's physical storage properties.
Alter Table Skewed
ALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)
ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]
[STORED AS DIRECTORIES];
ALTER TABLE table_name NOT SKEWED;
ALTER TABLE table_name NOT STORED AS DIRECTORIES;
ALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );
Add Partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
ALTER TABLE table_name ADD PARTITION (partCol = 'value2') location 'loc2';
...
ALTER TABLE table_name ADD PARTITION (partCol = 'valueN') location 'locN';
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
Discover Partitions
Automatically discovers and synchronizes the metadata of the partition in Hive Metastore.
When External Partitioned Tables are created, "discover.partitions"="true" table property gets automatically added. For managed partitioned tables, "discover.partitions" table property can be manually added. When Hive Metastore Service (HMS) is started in remote service mode, a background thread (PartitionManagementTask) gets scheduled periodically every 300s (configurable via metastore.partition.management.task.frequency config) that looks for tables with "discover.partitions" table property set to true and performs msck repair in sync mode. If the table is a transactional table, then Exclusive Lock is obtained for that table before performing msck repair. With this table property, "MSCK REPAIR TABLE table_name SYNC PARTITIONS" is no longer required to be run manually.
"discover.partitions"="true" and "partition.retention.period"="7d" then only the partitions created in last 7 days are retained.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
hive.msck.repair.batch.size
ALTER TABLE table_name RECOVER PARTITIONS;
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;
ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";
ALTER TABLE table_name TOUCH [PARTITION partition_spec];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
CREATE TABLE test_change (a int, b int, c int);
// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;
// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is: b int, a2 string, c int.
// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is: c1 int, b int, a2 string.
// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// hive.exec.dynamic.partition needs to be set to true to enable dynamic partitioning with ALTER PARTITION
SET hive.exec.dynamic.partition = true;
// This will alter all existing partitions in the table with ds='2008-04-08' -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// This will alter all existing partitions in the table -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds, hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
Alter Table Storage Properties
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
These statements change the table's physical storage properties.
Alter Table Skewed
ALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)
ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]
[STORED AS DIRECTORIES];
ALTER TABLE table_name NOT SKEWED;
ALTER TABLE table_name NOT STORED AS DIRECTORIES;
ALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );
Add Partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
ALTER TABLE table_name ADD PARTITION (partCol = 'value2') location 'loc2';
...
ALTER TABLE table_name ADD PARTITION (partCol = 'valueN') location 'locN';
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
Discover Partitions
Automatically discovers and synchronizes the metadata of the partition in Hive Metastore.
When External Partitioned Tables are created, "discover.partitions"="true" table property gets automatically added. For managed partitioned tables, "discover.partitions" table property can be manually added. When Hive Metastore Service (HMS) is started in remote service mode, a background thread (PartitionManagementTask) gets scheduled periodically every 300s (configurable via metastore.partition.management.task.frequency config) that looks for tables with "discover.partitions" table property set to true and performs msck repair in sync mode. If the table is a transactional table, then Exclusive Lock is obtained for that table before performing msck repair. With this table property, "MSCK REPAIR TABLE table_name SYNC PARTITIONS" is no longer required to be run manually.
"discover.partitions"="true" and "partition.retention.period"="7d" then only the partitions created in last 7 days are retained.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
hive.msck.repair.batch.size
ALTER TABLE table_name RECOVER PARTITIONS;
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;
ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";
ALTER TABLE table_name TOUCH [PARTITION partition_spec];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
CREATE TABLE test_change (a int, b int, c int);
// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;
// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is: b int, a2 string, c int.
// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is: c1 int, b int, a2 string.
// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// hive.exec.dynamic.partition needs to be set to true to enable dynamic partitioning with ALTER PARTITION
SET hive.exec.dynamic.partition = true;
// This will alter all existing partitions in the table with ds='2008-04-08' -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// This will alter all existing partitions in the table -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds, hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
Create View
CREATE
VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT
column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name =
property_value, ...)]
AS SELECT ...;
|
CREATE
VIEW onion_referrers(url COMMENT 'URL of Referring page')
COMMENT
'Referrers to The Onion website'
AS
SELECT
DISTINCT referrer_url
FROM
page_view
WHERE
page_url='http://www.theonion.com';
DROP VIEW [IF EXISTS] [db_name.]view_name;
ALTER
VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;
table_properties:
:
(property_name = property_value, property_name = property_value, ...)
ALTER VIEW [db_name.]view_name AS select_statement;
ALTER INDEX index_name ON table_name [PARTITION
partition_spec] REBUILD;
CREATE TEMPORARY FUNCTION function_name AS class_name;
SHOW (DATABASES|SCHEMAS) [LIKE
'identifier_with_wildcards'
];
SHOW
VIEWS;
--
show all views in the current database
SHOW
VIEWS 'test_*';
--
show all views that start with "test_"
SHOW
VIEWS '*view2';
--
show all views that end in "view2"
SHOW
VIEWS LIKE
'test_view1|test_view2';
--
show views named either "test_view1" or "test_view2"
SHOW
VIEWS FROM
test1;
--
show views from database test1
SHOW
VIEWS IN
test1;
--
show views from database test1 (FROM and IN are same)
SHOW
VIEWS IN
test1
"test_*";
--
show views from database test2 that start with "test_"
SHOW
PARTITIONS table_name PARTITION(ds='2010-03-03');
-- (Note: Hive 0.6
and
later)
SHOW
PARTITIONS table_name PARTITION(hr='12');
-- (Note: Hive 0.6
and
later)
SHOW
PARTITIONS table_name PARTITION(ds='2010-03-03',
hr='12');
-- (Note: Hive 0.6
and
later)
SHOW CREATE TABLE ([db_name.]table_name|view_name);
Standard syntax:
Select Syntax
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
hive.support.quoted.identifiers to none
SELECT * FROM t1
SELECT * FROM sales WHERE amount > 10 AND region = "US"
hive> SELECT col1, col2 FROM t1
1 3
1 3
1 4
2 5
hive> SELECT DISTINCT col1, col2 FROM t1
1 3
1 4
2 5
hive> SELECT DISTINCT col1 FROM t1
1
2
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10
SELECT * FROM customers LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 2,5
SELECT `(ds|hr)?+.+` FROM sales
CREATE
TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT
'A
bucketed copy of user_info'
PARTITIONED
BY(ds STRING)
CLUSTERED
BY(user_id) INTO 256
BUCKETS;
set
hive.enforce.bucketing = true;
-- (Note: Not needed in Hive 2.x
onward)
FROM
user_id
INSERT
OVERWRITE TABLE user_info_bucketed
PARTITION
(ds='2009-02-25')
SELECT
userid, firstname, lastname WHERE ds='2009-02-25';
Can be collected manually by: ANALYZE
TABLE ... COMPUTE STATISTICS
Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS
ANALYZE TABLE Table1 PARTITION(ds=
'2008-04-09'
,
hr) COMPUTE STATISTICS FOR COLUMNS;
Create/build, show, and drop index:
CREATE INDEX table01_index ON TABLE table01 (column2) AS 'COMPACT';
SHOW INDEX ON table01;
DROP INDEX table01_index ON table01;
Create then build, show formatted (with column
names), and drop index:
CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH
DEFERRED REBUILD;
ALTER INDEX table02_index ON table2 REBUILD;
SHOW FORMATTED INDEX ON table02;
DROP INDEX table02_index ON table02;
Create bitmap index, build, show, and drop:
CREATE INDEX table03_index ON TABLE table03 (column4) AS 'BITMAP' WITH
DEFERRED REBUILD;
ALTER INDEX table03_index ON table03 REBUILD;
SHOW FORMATTED INDEX ON table03;
DROP INDEX table03_index ON table03;
Create index in a new table:
CREATE INDEX table04_index ON TABLE table04 (column5) AS 'COMPACT' WITH
DEFERRED REBUILD IN TABLE table04_index_table;
Create index stored as RCFile:
CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT'
STORED AS RCFILE;
Create index stored as text file:
CREATE INDEX table06_index ON TABLE table06 (column7) AS 'COMPACT' ROW
FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
Create index with index properties:
CREATE INDEX table07_index ON TABLE table07 (column8) AS 'COMPACT'
IDXPROPERTIES ("prop1"="value1",
"prop2"="value2");
Create index with table properties:
CREATE INDEX table08_index ON TABLE table08 (column9) AS 'COMPACT'
TBLPROPERTIES ("prop3"="value3",
"prop4"="value4");
Drop index if exists:
DROP INDEX IF EXISTS table09_index ON table09;
Rebuild index on a partition:
ALTER INDEX table10_index ON table10 PARTITION (columnX='valueQ',
columnY='valueR') REBUILD;
hive>
set hive.archive.enabled=true;
hive>
set hive.archive.har.parentdir.settable=true;
hive>
set har.partfile.size=1099511627776;
Archive
Once the configuration values are set, a partition
can be archived with the command:
ALTER
TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value,
partition_col = partiton_col_value, ...)
|
For example:
ALTER
TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08',
hr='12')
|
Once the command is issued, a mapreduce job will
perform the archiving. Unlike Hive queries, there is no output on the CLI to
indicate process.
Unarchive
The partition can be reverted back to its original
files with the unarchive command:
ALTER
TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08',
hr='12')
|
LOAD
DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
LOAD
DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
Standard
syntax:
INSERT
OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF
NOT EXISTS]] select_statement1 FROM from_statement;
INSERT
INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;
Hive
extension (multiple inserts):
FROM
from_statement
INSERT
OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF
NOT EXISTS]] select_statement1
[INSERT
OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT
INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM
from_statement
INSERT
INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1
[INSERT
INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT
OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
...;
Hive
extension (dynamic partition inserts):
INSERT
OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...)
select_statement FROM from_statement;
INSERT
INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...)
select_statement FROM from_statement;
Configuration property
|
Default
|
Note
|
hive.error.on.empty.partition
|
false
|
Whether to throw an exception if
dynamic partition insert generates empty results
|
hive.exec.dynamic.partition
|
true
|
Needs to be set to true to enable dynamic partition inserts
|
hive.exec.dynamic.partition.mode
|
strict
|
In strict mode, the user must specify at least one static
partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic
|
hive.exec.max.created.files
|
100000
|
Maximum number of HDFS files
created by all mappers/reducers in a MapReduce job
|
hive.exec.max.dynamic.partitions
|
1000
|
Maximum number of dynamic
partitions allowed to be created in total
|
hive.exec.max.dynamic.partitions.pernode
|
100
|
Maximum number of dynamic
partitions allowed to be created in each mapper/reducer node
|
FROM
page_view_stg pvs
INSERT
OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
country)
SELECT
pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip, pvs.cnt
Standard syntax:
INSERT
OVERWRITE [LOCAL] DIRECTORY directory1
[ROW
FORMAT row_format] [STORED AS file_format] (Note: Only available starting with
Hive 0.11.0)
SELECT
... FROM ...
Hive
extension (multiple inserts):
FROM
from_statement
INSERT
OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT
OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format
:
DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP
KEYS TERMINATED BY char] [LINES
TERMINATED BY char]
[NULL
DEFINED AS char]
(Note: Only available starting with Hive 0.13)
Standard
Syntax:
INSERT
INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES
values_row [, values_row ...]
Where
values_row is:
(
value [, value ...] )
where
a value is either null
or
any valid SQL literal
CREATE TABLE students (name VARCHAR(64),
age INT,
gpa DECIMAL(3,
2))
CLUSTERED
BY
(age)
INTO
2
BUCKETS STORED AS
ORC;
INSERT INTO TABLE students
VALUES ('fred
flintstone', 35, 1.28), ('barney
rubble', 32, 2.32);
CREATE TABLE pageviews (userid VARCHAR(64),
link STRING, came_from STRING)
PARTITIONED
BY
(datestamp
STRING) CLUSTERED BY
(userid)
INTO
256
BUCKETS STORED AS
ORC;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp)
VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
INSERT INTO TABLE pageviews
VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
Standard
Syntax:
UPDATE
tablename SET column = value [, column = value ...] [WHERE expression]
Standard
Syntax:
DELETE
FROM tablename [WHERE expression]
Standard
Syntax:
MERGE
INTO AS T USING AS S
ON
<boolean expression1>
WHEN
MATCHED [AND <boolean expression2>] THEN UPDATE SET
WHEN
MATCHED [AND <boolean expression3>] THEN DELETE
WHEN
NOT MATCHED [AND <boolean expression4>] THEN INSERT
VALUES
Export Syntax
EXPORT TABLE tablename [PARTITION (part_column= "value" [, ...])] TO 'export_target_path' [ FOR replication( 'eventid' ) ] |
Import Syntax
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION
(part_column= "value" [, ...])]] FROM 'source_path' [LOCATION 'import_target_path' ] |
Examples
Simple
export and import:
export table department to 'hdfs_exports_location/department';
import from 'hdfs_exports_location/department';
Rename
table on import:
export table department to 'hdfs_exports_location/department';
import table imported_dept from 'hdfs_exports_location/department';
Export
partition and import:
export table employee partition (emp_country="in", emp_state="ka") to 'hdfs_exports_location/employee';
import from 'hdfs_exports_location/employee';
Export
table and import partition:
export table employee to 'hdfs_exports_location/employee';
import table employee partition (emp_country="us", emp_state="tn") from 'hdfs_exports_location/employee';
Specify
the import location:
export table department to 'hdfs_exports_location/department';
import table department from 'hdfs_exports_location/department'
location 'import_target_location/department';
Import as
an external table:
export table department to 'hdfs_exports_location/department';
import external table department from 'hdfs_exports_location/department';
Select Syntax
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
hive.support.quoted.identifiers to none
SELECT * FROM t1
SELECT * FROM sales WHERE amount > 10 AND region = "US"
hive> SELECT col1, col2 FROM t1
1 3
1 3
1 4
2 5
hive> SELECT DISTINCT col1, col2 FROM t1
1 3
1 4
2 5
hive> SELECT DISTINCT col1 FROM t1
1
2
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10
SELECT * FROM customers LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 2,5
SELECT `(ds|hr)?+.+` FROM sales
groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
SELECT COUNT(*) FROM table2;
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
vCREATE TABLE t1(a INTEGER, b INTGER);
SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST) -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
SELECT key, value FROM src SORT BY key ASC, value DESC
The query had 2 reducers, and the output of each is:
0 5
0 3
3 6
9 1
0 4
0 3
1 1
2 5
SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
For example, we are Distributing By x on the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
SELECT COUNT(*) FROM table2;
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
vCREATE TABLE t1(a INTEGER, b INTGER);
SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST) -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
SELECT key, value FROM src SORT BY key ASC, value DESC
The query had 2 reducers, and the output of each is:
0 5
0 3
3 6
9 1
0 4
0 3
1 1
2 5
SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
For example, we are Distributing By x on the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
Collection Functions
The following built-in collection functions are supported in Hive:
Return Type
|
Name(Signature)
|
Description
|
---|---|---|
int
|
size(Map
|
Returns the number of elements in the map type.
|
int
|
size(Array
|
Returns the number of elements in the array type.
|
array
|
map_keys(Map
|
Returns an unordered array containing the keys of the input map.
|
array
|
map_values(Map
|
Returns an unordered array containing the values of the input map.
|
boolean
|
array_contains(Array
|
Returns TRUE if the array contains value.
|
array
|
sort_array(Array
|
Type Conversion Functions
The following type conversion functions are supported in Hive:
Return Type
|
Name(Signature)
|
Description
|
---|---|---|
binary
|
binary(string|binary)
|
Casts the parameter into a binary.
|
Expected "=" to follow "type"
|
cast(expr as
|
Converts the results of the expression expr to
|
Date Functions
The following built-in date functions are supported in Hive:
Return Type
|
Name(Signature)
|
Description
|
---|---|---|
Return Type
|
Name(Signature)
|
Description
|
string | from_unixtime(bigint unixtime[, string format]) | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00". |
bigint | unix_timestamp() | Gets current Unix timestamp in seconds. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant. |
bigint | unix_timestamp(string date) | Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 |
bigint | unix_timestamp(string date, string pattern) | Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400. |
pre 2.1.0: string 2.1.0 on: date | to_date(string timestamp) | Returns the date part of a timestamp string (pre-Hive 2.1.0): to_date("1970-01-01 00:00:00") = "1970-01-01". As of Hive 2.1.0, returns a date object. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created. |
int | year(string date) | Returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970. |
int | quarter(date/timestamp/string) | Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4 (as of Hive 1.3.0). Example: quarter('2015-04-08') = 2. |
int | month(string date) | Returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11. |
int | day(string date) dayofmonth(date) | Returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1. |
int | hour(string date) | Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12. |
int | minute(string date) | Returns the minute of the timestamp. |
int | second(string date) | Returns the second of the timestamp. |
int | weekofyear(string date) | Returns the week number of a timestamp string: weekofyear("1970-11-01 00:00:00") = 44, weekofyear("1970-11-01") = 44. |
int | extract(field FROM source) | Retrieve fields such as days or hours from source (as of Hive 2.2.0). Source must be a date, timestamp, interval or a string that can be converted into either a date or timestamp. Supported fields include: day, dayofweek, hour, minute, month, quarter, second, week and year. Examples:
|
int | datediff(string enddate, string startdate) | Returns the number of days from startdate to enddate: datediff('2009-03-01', '2009-02-27') = 2. |
pre 2.1.0: string 2.1.0 on: date | date_add(date/timestamp/string startdate, tinyint/smallint/int days) | Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created. |
pre 2.1.0: string 2.1.0 on: date | date_sub(date/timestamp/string startdate, tinyint/smallint/int days) | Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = '2008-12-30'. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created. |
timestamp | from_utc_timestamp({any primitive type} ts, string timezone) | Converts a timestamp* in UTC to a given timezone (as of Hive 0.8.0). * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal. Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, from_utc_timestamp(2592000.0,'PST'), from_utc_timestamp(2592000000,'PST') and from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-30 08:00:00. |
timestamp | to_utc_timestamp({any primitive type} ts, string timezone) | Converts a timestamp* in a given timezone to UTC (as of Hive 0.8.0). * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal. Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, to_utc_timestamp(2592000.0,'PST'), to_utc_timestamp(2592000000,'PST') and to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-31 00:00:00. |
date | current_date | Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value. |
timestamp | current_timestamp | Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. |
string | add_months(string start_date, int num_months, output_date_format) | Returns the date that is num_months after start_date (as of Hive 1.1.0). start_date is a string, date or timestamp. num_months is an integer. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date. The default output format is 'yyyy-MM-dd'. Before Hive 4.0.0, the time part of the date is ignored. As of Hive 4.0.0, add_months supports an optional argument output_date_format, which accepts a String that represents a valid date format for the output. This allows to retain the time format in the output. For example : add_months('2009-08-31', 1) returns '2009-09-30'. add_months('2017-12-31 14:15:16', 2, 'YYYY-MM-dd HH:mm:ss') returns '2018-02-28 14:15:16'. |
string | last_day(string date) | Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time part of date is ignored. |
string | next_day(string start_date, string day_of_week) | Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20. |
string | trunc(string date, string format) | Returns date truncated to the unit specified by the format (as of Hive 1.2.0). Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. Example: trunc('2015-03-17', 'MM') = 2015-03-01. |
double | months_between(date1, date2) | Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') = 3.94959677 |
string | date_format(date/timestamp/string ts, string fmt) | Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'. date_format can be used to implement other UDFs, e.g.:
|
No comments:
Post a Comment