Alter Table
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];
ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;
serde_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
Alter Table Storage Properties
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
These statements change the table's physical storage properties.
Alter Table Skewed
ALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)
ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]
[STORED AS DIRECTORIES];
ALTER TABLE table_name NOT SKEWED;
ALTER TABLE table_name NOT STORED AS DIRECTORIES;
ALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );
Add Partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
ALTER TABLE table_name ADD PARTITION (partCol = 'value2') location 'loc2';
...
ALTER TABLE table_name ADD PARTITION (partCol = 'valueN') location 'locN';
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
Discover Partitions
Automatically discovers and synchronizes the metadata of the partition in Hive Metastore.
When External Partitioned Tables are created, "discover.partitions"="true" table property gets automatically added. For managed partitioned tables, "discover.partitions" table property can be manually added. When Hive Metastore Service (HMS) is started in remote service mode, a background thread (PartitionManagementTask) gets scheduled periodically every 300s (configurable via metastore.partition.management.task.frequency config) that looks for tables with "discover.partitions" table property set to true and performs msck repair in sync mode. If the table is a transactional table, then Exclusive Lock is obtained for that table before performing msck repair. With this table property, "MSCK REPAIR TABLE table_name SYNC PARTITIONS" is no longer required to be run manually.
"discover.partitions"="true" and "partition.retention.period"="7d" then only the partitions created in last 7 days are retained.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
hive.msck.repair.batch.size
ALTER TABLE table_name RECOVER PARTITIONS;
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;
ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";
ALTER TABLE table_name TOUCH [PARTITION partition_spec];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
CREATE TABLE test_change (a int, b int, c int);
// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;
// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is: b int, a2 string, c int.
// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is: c1 int, b int, a2 string.
// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// hive.exec.dynamic.partition needs to be set to true to enable dynamic partitioning with ALTER PARTITION
SET hive.exec.dynamic.partition = true;
// This will alter all existing partitions in the table with ds='2008-04-08' -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// This will alter all existing partitions in the table -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds, hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
INSERT
INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...)
select_statement FROM from_statement;
Configuration property
Default
Note
hive.error.on.empty.partition
false
Whether to throw an exception if
dynamic partition insert generates empty results
hive.exec.dynamic.partition
true
Needs to be set to true to enable dynamic partition inserts
hive.exec.dynamic.partition.mode
strict
In strict mode, the user must specify at least one static
partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic
hive.exec.max.created.files
100000
Maximum number of HDFS files
created by all mappers/reducers in a MapReduce job
hive.exec.max.dynamic.partitions
1000
Maximum number of dynamic
partitions allowed to be created in total
hive.exec.max.dynamic.partitions.pernode
100
Maximum number of dynamic
partitions allowed to be created in each mapper/reducer node
export table department to 'hdfs_exports_location/department';
import from 'hdfs_exports_location/department';
Rename
table on import:
export table department to 'hdfs_exports_location/department';
import table imported_dept from 'hdfs_exports_location/department';
Export
partition and import:
export table employee partition (emp_country="in", emp_state="ka") to 'hdfs_exports_location/employee';
import from 'hdfs_exports_location/employee';
Export
table and import partition:
export table employee to 'hdfs_exports_location/employee';
import table employee partition (emp_country="us", emp_state="tn") from 'hdfs_exports_location/employee';
Specify
the import location:
export table department to 'hdfs_exports_location/department';
import table department from 'hdfs_exports_location/department'
location 'import_target_location/department';
Import as
an external table:
export table department to 'hdfs_exports_location/department';
import external table department from 'hdfs_exports_location/department';
Select Syntax
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
hive.support.quoted.identifiers to none
SELECT * FROM t1
SELECT * FROM sales WHERE amount > 10 AND region = "US"
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10
SELECT * FROM customers LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 5
SELECT * FROM customers ORDER BY create_date LIMIT 2,5
SELECT `(ds|hr)?+.+` FROM sales
groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
SELECT COUNT(*) FROM table2;
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
vCREATE TABLE t1(a INTEGER, b INTGER);
SELECT
a,
sum(b)
FROM
t1
GROUP BY
a;
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST) -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
SELECT key, value FROM src SORT BY key ASC, value DESC
The query had 2 reducers, and the output of each is:
0 5
0 3
3 6
9 1
0 4
0 3
1 1
2 5
SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
For example, we are Distributing By x on the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
Collection Functions
The following built-in collection functions are supported in Hive:
Return Type
Name(Signature)
Description
int
size(Map)
Returns the number of elements in the map type.
int
size(Array)
Returns the number of elements in the array type.
array
map_keys(Map)
Returns an unordered array containing the keys of the input map.
array
map_values(Map)
Returns an unordered array containing the values of the input map.
boolean
array_contains(Array, value)
Returns TRUE if the array contains value.
array
sort_array(Array)
Type Conversion Functions
The following type conversion functions are supported in Hive:
Return Type
Name(Signature)
Description
binary
binary(string|binary)
Casts the parameter into a binary.
Expected "=" to follow "type"
cast(expr as )
Converts the results of the expression expr to . For example, cast('1' as BIGINT) will convert the string '1' to its integral representation. A null is returned if the conversion does not succeed. If cast(expr as boolean) Hive returns true for a non-empty string.
Date Functions
The following built-in date functions are supported in Hive:
Return Type
Name(Signature)
Description
Return Type
Name(Signature)
Description
string
from_unixtime(bigint unixtime[, string format])
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
bigint
unix_timestamp()
Gets current Unix timestamp in seconds. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant.
bigint
unix_timestamp(string date)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
Returns the date part of a timestamp string (pre-Hive 2.1.0): to_date("1970-01-01 00:00:00") = "1970-01-01". As of Hive 2.1.0, returns a date object.
Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
int
year(string date)
Returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970.
int
quarter(date/timestamp/string)
Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4 (as of Hive 1.3.0). Example: quarter('2015-04-08') = 2.
int
month(string date)
Returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11.
int
day(string date) dayofmonth(date)
Returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1.
int
hour(string date)
Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12.
int
minute(string date)
Returns the minute of the timestamp.
int
second(string date)
Returns the second of the timestamp.
int
weekofyear(string date)
Returns the week number of a timestamp string: weekofyear("1970-11-01 00:00:00") = 44, weekofyear("1970-11-01") = 44.
int
extract(field FROM source)
Retrieve fields such as days or hours from source (as of Hive 2.2.0). Source must be a date, timestamp, interval or a string that can be converted into either a date or timestamp. Supported fields include: day, dayofweek, hour, minute, month, quarter, second, week and year.
Examples:
select extract(month from "2016-10-20") results in 10.
select extract(hour from "2016-10-20 05:06:07") results in 5.
select extract(dayofweek from "2016-10-20 05:06:07") results in 5.
select extract(month from interval '1-3' year to month) results in 3.
select extract(minute from interval '3 12:20:30' day to second) results in 20.
int
datediff(string enddate, string startdate)
Returns the number of days from startdate to enddate: datediff('2009-03-01', '2009-02-27') = 2.
Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'.
Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = '2008-12-30'.
Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
Converts a timestamp* in UTC to a given timezone (as of Hive 0.8.0).
* timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal. Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, from_utc_timestamp(2592000.0,'PST'), from_utc_timestamp(2592000000,'PST') and from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-30 08:00:00.
Converts a timestamp* in a given timezone to UTC (as of Hive 0.8.0).
* timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal.
Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, to_utc_timestamp(2592000.0,'PST'), to_utc_timestamp(2592000000,'PST') and to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-31 00:00:00.
date
current_date
Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value.
timestamp
current_timestamp
Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
string
add_months(string start_date, int num_months, output_date_format)
Returns the date that is num_months after start_date (as of Hive 1.1.0). start_date is a string, date or timestamp. num_months is an integer. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date. The default output format is 'yyyy-MM-dd'.
Before Hive 4.0.0, the time part of the date is ignored.
As of Hive 4.0.0, add_months supports an optional argument output_date_format, which accepts a String that represents a valid date format for the output. This allows to retain the time format in the output.
For example :
add_months('2009-08-31', 1) returns '2009-09-30'.
add_months('2017-12-31 14:15:16', 2, 'YYYY-MM-dd HH:mm:ss') returns '2018-02-28 14:15:16'.
string
last_day(string date)
Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time part of date is ignored.
string
next_day(string start_date, string day_of_week)
Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20.
string
trunc(string date, string format)
Returns date truncated to the unit specified by the format (as of Hive 1.2.0). Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. Example: trunc('2015-03-17', 'MM') = 2015-03-01.
double
months_between(date1, date2)
Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') = 3.94959677
string
date_format(date/timestamp/string ts, string fmt)
Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'.
date_format can be used to implement other UDFs, e.g.:
No comments:
Post a Comment