Bigdata_Spark_Consultant: Hive Expert part2

Alter Table
Rename Table
ALTER TABLE table_name RENAME TO new_table_name;
Alter Table Properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;

table_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);
ALTER TABLE table_name [PARTITION partition_spec] SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties];

ALTER TABLE table_name [PARTITION partition_spec] SET SERDEPROPERTIES serde_properties;

serde_properties:
: (property_name = property_value, property_name = property_value, ... )
ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
Alter Table Storage Properties
ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]
INTO num_buckets BUCKETS;
These statements change the table's physical storage properties.
Alter Table Skewed
ALTER TABLE table_name SKEWED BY (col_name1, col_name2, ...)
ON ([(col_name1_value, col_name2_value, ...) [, (col_name1_value, col_name2_value), ...]
[STORED AS DIRECTORIES];
ALTER TABLE table_name NOT SKEWED;
ALTER TABLE table_name NOT STORED AS DIRECTORIES;
ALTER TABLE table_name SET SKEWED LOCATION (col_name1="location1" [, col_name2="location2", ...] );
Add Partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];

partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)
ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';
ALTER TABLE table_name ADD PARTITION (partCol = 'value2') location 'loc2';
...
ALTER TABLE table_name ADD PARTITION (partCol = 'valueN') location 'locN';
-- Move partition from table_name_1 to table_name_2
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec) WITH TABLE table_name_1;
-- multiple partitions
ALTER TABLE table_name_2 EXCHANGE PARTITION (partition_spec, partition_spec2, ...) WITH TABLE table_name_1;
Discover Partitions
Automatically discovers and synchronizes the metadata of the partition in Hive Metastore.
When External Partitioned Tables are created, "discover.partitions"="true" table property gets automatically added. For managed partitioned tables, "discover.partitions" table property can be manually added. When Hive Metastore Service (HMS) is started in remote service mode, a background thread (PartitionManagementTask) gets scheduled periodically every 300s (configurable via metastore.partition.management.task.frequency config) that looks for tables with "discover.partitions" table property set to true and performs msck repair in sync mode. If the table is a transactional table, then Exclusive Lock is obtained for that table before performing msck repair. With this table property, "MSCK REPAIR TABLE table_name SYNC PARTITIONS" is no longer required to be run manually.
"discover.partitions"="true" and "partition.retention.period"="7d" then only the partitions created in last 7 days are retained.
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
hive.msck.repair.batch.size
ALTER TABLE table_name RECOVER PARTITIONS;
ALTER TABLE page_view DROP PARTITION (dt='2008-08-08', country='us');
ALTER TABLE table_name ARCHIVE PARTITION partition_spec;
ALTER TABLE table_name UNARCHIVE PARTITION partition_spec;
ALTER TABLE table_name [PARTITION partition_spec] SET FILEFORMAT file_format;
ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location";
ALTER TABLE table_name TOUCH [PARTITION partition_spec];
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP [CASCADE];

ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
CREATE TABLE test_change (a int, b int, c int);

// First change column a's name to a1.
ALTER TABLE test_change CHANGE a a1 INT;

// Next change column a1's name to a2, its data type to string, and put it after column b.
ALTER TABLE test_change CHANGE a1 a2 STRING AFTER b;
// The new table's structure is: b int, a2 string, c int.

// Then change column c's name to c1, and put it as the first column.
ALTER TABLE test_change CHANGE c c1 INT FIRST;
// The new table's structure is: c1 int, b int, a2 string.

// Add a comment to column a1
ALTER TABLE test_change CHANGE a1 a1 INT COMMENT 'this is column a1';
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=11) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
ALTER TABLE foo PARTITION (ds='2008-04-08', hr=12) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);
// hive.exec.dynamic.partition needs to be set to true to enable dynamic partitioning with ALTER PARTITION
SET hive.exec.dynamic.partition = true;

// This will alter all existing partitions in the table with ds='2008-04-08' -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds='2008-04-08', hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);

// This will alter all existing partitions in the table -- be sure you know what you are doing!
ALTER TABLE foo PARTITION (ds, hr) CHANGE COLUMN dec_column_name dec_column_name DECIMAL(38,18);

Create View

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]

[COMMENT view_comment]

[TBLPROPERTIES (property_name = property_value, ...)]

AS SELECT ...;

CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')

COMMENT 'Referrers to The Onion website'

SELECT DISTINCT referrer_url

FROM page_view

WHERE page_url='http://www.theonion.com';

DROP VIEW [IF EXISTS] [db_name.]view_name;

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;

table_properties:

: (property_name = property_value, property_name = property_value, ...)

ALTER VIEW [db_name.]view_name AS select_statement;

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

CREATE TEMPORARY FUNCTION function_name AS class_name;

SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];

SHOW VIEWS; -- show all views in the current database

SHOW VIEWS 'test_*'; -- show all views that start with "test_"

SHOW VIEWS '*view2'; -- show all views that end in "view2"

SHOW VIEWS LIKE 'test_view1|test_view2'; -- show views named either "test_view1" or "test_view2"

SHOW VIEWS FROM test1; -- show views from database test1

SHOW VIEWS IN test1; -- show views from database test1 (FROM and IN are same)

SHOW VIEWS IN test1 "test_*"; -- show views from database test2 that start with "test_"

SHOW PARTITIONS table_name PARTITION(ds='2010-03-03'); -- (Note: Hive 0.6 and later)

SHOW PARTITIONS table_name PARTITION(hr='12'); -- (Note: Hive 0.6 and later)

SHOW PARTITIONS table_name PARTITION(ds='2010-03-03', hr='12'); -- (Note: Hive 0.6 and later)

SHOW CREATE TABLE ([db_name.]table_name|view_name);

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)

COMMENT 'A bucketed copy of user_info'

PARTITIONED BY(ds STRING)

CLUSTERED BY(user_id) INTO 256 BUCKETS;

set hive.enforce.bucketing = true; -- (Note: Not needed in Hive 2.x onward)

FROM user_id

INSERT OVERWRITE TABLE user_info_bucketed

PARTITION (ds='2009-02-25')

SELECT userid, firstname, lastname WHERE ds='2009-02-25';

Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS

Can be collected manually by: ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09'

,
hr) COMPUTE STATISTICS FOR COLUMNS;

Create/build, show, and drop index:

CREATE INDEX table01_index ON TABLE table01 (column2) AS 'COMPACT';

SHOW INDEX ON table01;

DROP INDEX table01_index ON table01;

Create then build, show formatted (with column names), and drop index:

CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;

ALTER INDEX table02_index ON table2 REBUILD;

SHOW FORMATTED INDEX ON table02;

DROP INDEX table02_index ON table02;

Create bitmap index, build, show, and drop:

CREATE INDEX table03_index ON TABLE table03 (column4) AS 'BITMAP' WITH DEFERRED REBUILD;

ALTER INDEX table03_index ON table03 REBUILD;

SHOW FORMATTED INDEX ON table03;

DROP INDEX table03_index ON table03;

Create index in a new table:

CREATE INDEX table04_index ON TABLE table04 (column5) AS 'COMPACT' WITH DEFERRED REBUILD IN TABLE table04_index_table;

Create index stored as RCFile:

CREATE INDEX table05_index ON TABLE table05 (column6) AS 'COMPACT' STORED AS RCFILE;

Create index stored as text file:

CREATE INDEX table06_index ON TABLE table06 (column7) AS 'COMPACT' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

Create index with index properties:

CREATE INDEX table07_index ON TABLE table07 (column8) AS 'COMPACT' IDXPROPERTIES ("prop1"="value1", "prop2"="value2");

Create index with table properties:

CREATE INDEX table08_index ON TABLE table08 (column9) AS 'COMPACT' TBLPROPERTIES ("prop3"="value3", "prop4"="value4");

Drop index if exists:

DROP INDEX IF EXISTS table09_index ON table09;

Rebuild index on a partition:

ALTER INDEX table10_index ON table10 PARTITION (columnX='valueQ', columnY='valueR') REBUILD;

hive> set hive.archive.enabled=true;

hive> set hive.archive.har.parentdir.settable=true;

hive> set har.partfile.size=1099511627776;

Archive

Once the configuration values are set, a partition can be archived with the command:

ALTER TABLE table_name ARCHIVE PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

For example:

ALTER TABLE srcpart ARCHIVE PARTITION(ds='2008-04-08', hr='12')

Once the command is issued, a mapreduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate process.

Unarchive

The partition can be reverted back to its original files with the unarchive command:

ALTER TABLE srcpart UNARCHIVE PARTITION(ds='2008-04-08', hr='12')

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

Standard syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]

[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;

FROM from_statement

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1

[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]

[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

Hive extension (dynamic partition inserts):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

Configuration property	Default	Note
hive.error.on.empty.partition	false	Whether to throw an exception if dynamic partition insert generates empty results
hive.exec.dynamic.partition	true	Needs to be set to true to enable dynamic partition inserts
hive.exec.dynamic.partition.mode	strict	In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic
hive.exec.max.created.files	100000	Maximum number of HDFS files created by all mappers/reducers in a MapReduce job
hive.exec.max.dynamic.partitions	1000	Maximum number of dynamic partitions allowed to be created in total
hive.exec.max.dynamic.partitions.pernode	100	Maximum number of dynamic partitions allowed to be created in each mapper/reducer node

FROM page_view_stg pvs

INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Standard syntax:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1

[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)

SELECT ... FROM ...

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

row_format

: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]

[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

Standard Syntax:

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

Where values_row is:

( value [, value ...] )

where a value is either null or any valid SQL literal

CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))

CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;

INSERT INTO TABLE students

VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);

CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)

PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;

INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')

VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);

INSERT INTO TABLE pageviews PARTITION (datestamp)

VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');

INSERT INTO TABLE pageviews

VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');

Standard Syntax:

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

hive.optimize.sort.dynamic.partition=false

Standard Syntax:

DELETE FROM tablename [WHERE expression]

Standard Syntax:

MERGE INTO AS T USING AS S

ON <boolean expression1>

WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET

WHEN MATCHED [AND <boolean expression3>] THEN DELETE

WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES

Export Syntax

EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]

TO 'export_target_path' [ FOR replication('eventid') ]

Import Syntax

IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION
  (part_column=

"value"[, ...])]]

FROM 'source_path'

[LOCATION 'import_target_path']

Examples

Simple export and import:

export table department to 'hdfs_exports_location/department';

import from 'hdfs_exports_location/department';

Rename table on import:

export table department to 'hdfs_exports_location/department';

import table imported_dept from 'hdfs_exports_location/department';

Export partition and import:

export table employee partition (emp_country="in", emp_state="ka") to 'hdfs_exports_location/employee';

import from 'hdfs_exports_location/employee';

Export table and import partition:

export table employee to 'hdfs_exports_location/employee';

import table employee partition (emp_country="us", emp_state="tn") from 'hdfs_exports_location/employee';

Specify the import location:

export table department to 'hdfs_exports_location/department';

import table department from 'hdfs_exports_location/department'

       location 'import_target_location/department';

Import as an external table:

export table department to 'hdfs_exports_location/department';

import external table department from 'hdfs_exports_location/department';

Select Syntax
[WITH CommonTableExpression (, CommonTableExpression)*]    (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]

hive.support.quoted.identifiers to none

SELECT * FROM t1

SELECT * FROM sales WHERE amount > 10 AND region = "US"

hive> SELECT col1, col2 FROM t1
    1 3
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1, col2 FROM t1
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1 FROM t1
    1
    2

SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10

SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10

SELECT * FROM customers LIMIT 5

SELECT * FROM customers ORDER BY create_date LIMIT 5

SELECT * FROM customers ORDER BY create_date LIMIT 2,5

SELECT `(ds|hr)?+.+` FROM sales

groupByClause: GROUP BY groupByExpression (, groupByExpression)*

groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM src groupByClause?
SELECT COUNT(*) FROM table2;
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
vCREATE TABLE t1(a INTEGER, b INTGER);

SELECT
   a,
   sum(b)
FROM
   t1
GROUP BY
   a;

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;

set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;

Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
colNullOrder: (NULLS FIRST | NULLS LAST)           -- (Note: Available in Hive 2.1.0 and later)
orderBy: ORDER BY colName colOrder? colNullOrder? (',' colName colOrder? colNullOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )

sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
SELECT key, value FROM src SORT BY key ASC, value DESC
The query had 2 reducers, and the output of each is:
0   5
0   3
3   6
9   1
0   4
0   3
1   1
2   5
SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1

SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
For example, we are Distributing By x on the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3

Collection Functions

The following built-in collection functions are supported in Hive:

Return Type	Name(Signature)	Description
int	size(Map)	Returns the number of elements in the map type.
int	size(Array)	Returns the number of elements in the array type.
array	map_keys(Map)	Returns an unordered array containing the keys of the input map.
array	map_values(Map)	Returns an unordered array containing the values of the input map.
boolean	array_contains(Array, value)	Returns TRUE if the array contains value.
array	sort_array(Array)

Type Conversion Functions

The following type conversion functions are supported in Hive:

Return Type	Name(Signature)	Description
binary	binary(string\|binary)	Casts the parameter into a binary.
Expected "=" to follow "type"	cast(expr as )	Converts the results of the expression expr to . For example, cast('1' as BIGINT) will convert the string '1' to its integral representation. A null is returned if the conversion does not succeed. If cast(expr as boolean) Hive returns true for a non-empty string.

Date Functions

The following built-in date functions are supported in Hive:

Return Type	Name(Signature)	Description
Return Type	Name(Signature)	Description
string	from_unixtime(bigint unixtime[, string format])	Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
bigint	unix_timestamp()	Gets current Unix timestamp in seconds. This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant.
bigint	unix_timestamp(string date)	Converts time string in format `yyyy-MM-dd HH:mm:ss` to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
bigint	unix_timestamp(string date, string pattern)	Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.
pre 2.1.0: string 2.1.0 on: date	to_date(string timestamp)	Returns the date part of a timestamp string (pre-Hive 2.1.0): to_date("1970-01-01 00:00:00") = "1970-01-01". As of Hive 2.1.0, returns a date object. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
int	year(string date)	Returns the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970.
int	quarter(date/timestamp/string)	Returns the quarter of the year for a date, timestamp, or string in the range 1 to 4 (as of Hive 1.3.0). Example: quarter('2015-04-08') = 2.
int	month(string date)	Returns the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11.
int	day(string date) dayofmonth(date)	Returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1.
int	hour(string date)	Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12, hour('12:58:59') = 12.
int	minute(string date)	Returns the minute of the timestamp.
int	second(string date)	Returns the second of the timestamp.
int	weekofyear(string date)	Returns the week number of a timestamp string: weekofyear("1970-11-01 00:00:00") = 44, weekofyear("1970-11-01") = 44.
int	extract(field FROM source)	Retrieve fields such as days or hours from source (as of Hive 2.2.0). Source must be a date, timestamp, interval or a string that can be converted into either a date or timestamp. Supported fields include: day, dayofweek, hour, minute, month, quarter, second, week and year. Examples: select extract(month from "2016-10-20") results in 10. select extract(hour from "2016-10-20 05:06:07") results in 5. select extract(dayofweek from "2016-10-20 05:06:07") results in 5. select extract(month from interval '1-3' year to month) results in 3. select extract(minute from interval '3 12:20:30' day to second) results in 20.
int	datediff(string enddate, string startdate)	Returns the number of days from startdate to enddate: datediff('2009-03-01', '2009-02-27') = 2.
pre 2.1.0: string 2.1.0 on: date	date_add(date/timestamp/string startdate, tinyint/smallint/int days)	Adds a number of days to startdate: date_add('2008-12-31', 1) = '2009-01-01'. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
pre 2.1.0: string 2.1.0 on: date	date_sub(date/timestamp/string startdate, tinyint/smallint/int days)	Subtracts a number of days to startdate: date_sub('2008-12-31', 1) = '2008-12-30'. Prior to Hive 2.1.0 (HIVE-13248) the return type was a String because no Date type existed when the method was created.
timestamp	from_utc_timestamp({any primitive type} ts, string timezone)	Converts a timestamp* in UTC to a given timezone (as of Hive 0.8.0). * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal. Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, from_utc_timestamp(2592000.0,'PST'), from_utc_timestamp(2592000000,'PST') and from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-30 08:00:00.
timestamp	to_utc_timestamp({any primitive type} ts, string timezone)	Converts a timestamp* in a given timezone to UTC (as of Hive 0.8.0). * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, float/double and decimal. Fractional values are considered as seconds. Integer values are considered as milliseconds. For example, to_utc_timestamp(2592000.0,'PST'), to_utc_timestamp(2592000000,'PST') and to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST') all return the timestamp 1970-01-31 00:00:00.
date	current_date	Returns the current date at the start of query evaluation (as of Hive 1.2.0). All calls of current_date within the same query return the same value.
timestamp	current_timestamp	Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
string	add_months(string start_date, int num_months, output_date_format)	Returns the date that is num_months after start_date (as of Hive 1.1.0). start_date is a string, date or timestamp. num_months is an integer. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date. The default output format is 'yyyy-MM-dd'. Before Hive 4.0.0, the time part of the date is ignored. As of Hive 4.0.0, add_months supports an optional argument output_date_format, which accepts a String that represents a valid date format for the output. This allows to retain the time format in the output. For example : add_months('2009-08-31', 1) returns '2009-09-30'. add_months('2017-12-31 14:15:16', 2, 'YYYY-MM-dd HH:mm:ss') returns '2018-02-28 14:15:16'.
string	last_day(string date)	Returns the last day of the month which the date belongs to (as of Hive 1.1.0). date is a string in the format 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time part of date is ignored.
string	next_day(string start_date, string day_of_week)	Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored. Example: next_day('2015-01-14', 'TU') = 2015-01-20.
string	trunc(string date, string format)	Returns date truncated to the unit specified by the format (as of Hive 1.2.0). Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. Example: trunc('2015-03-17', 'MM') = 2015-03-01.
double	months_between(date1, date2)	Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer. Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format 'yyyy-MM-dd' or 'yyyy-MM-dd HH:mm:ss'. The result is rounded to 8 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') = 3.94959677
string	date_format(date/timestamp/string ts, string fmt)	Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). Supported formats are Java SimpleDateFormat formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html. The second argument fmt should be constant. Example: date_format('2015-04-08', 'y') = '2015'. date_format can be used to implement other UDFs, e.g.: dayname(date) is date_format(date, 'EEEE') dayofyear(date) is date_format(date, 'D')

Bigdata_Spark_Consultant

Friday, October 4, 2019

Hive Expert part2

Export Syntax

Import Syntax

Examples

Collection Functions

Type Conversion Functions

Date Functions

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels