Bigdata_Spark_Consultant: Hive Expert part 1

Commands are non-SQL statements such as setting a property or adding a resource. They can be used in HiveQL scripts or directly in the CLI or Beeline.

Command

Description

Command

Description

quit

exit

Use quit or exit to leave the interactive shell.

reset

Resets the configuration to the default values (as of Hive 0.10: see HIVE-3202). Any configuration parameters that were set using the set command or -hiveconf parameter in hive commandline will get reset to default value.

Note that this does not apply to configuration parameters that were set in set command using the "hiveconf:" prefix for the key name (for historic reasons).

set =

Sets the value of a particular configuration variable (key).

Note: If you misspell the variable name, the CLI will not show an error.

set

Prints a list of configuration variables that are overridden by the user or Hive.

set -v

Prints all Hadoop and Hive configuration variables.

add FILE[S]  *
add JAR[S]  *
add ARCHIVE[S]  *

Adds one or more files, jars, or archives to the list of resources in the distributed cache. See Hive Resources for more information.
add FILE[S]  * 
add JAR[S]  * 
add ARCHIVE[S] *As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the distributed cache using an Ivy URL of the form ivy://group:module:version?query_string. See Hive Resources for more information.

list FILE[S]

list JAR[S]

list ARCHIVE[S]

Lists the resources already added to the distributed cache. See Hive Resources for more information.

list FILE[S] *
list JAR[S] *
list ARCHIVE[S] *

Checks whether the given resources are already added to the distributed cache or not. See Hive Resources for more information.

delete FILE[S] *
delete JAR[S] *
delete ARCHIVE[S] *

Removes the resource(s) from the distributed cache.
delete FILE[S]  * 
delete JAR[S]  * 
delete ARCHIVE[S]  *As of Hive 1.2.0, removes the resource(s) which were added using the  from the distributed cache. See Hive Resources for more information.

! 

Executes a shell command from the Hive shell.

dfs 

Executes a dfs command from the Hive shell.

Executes a Hive query and prints results to standard output.

source FILE 

Executes a script file inside the CLI.

compile `` AS GROOVY NAMED 

This allows inline Groovy code to be compiled and be used as a UDF (as of Hive 0.13.0). For a usage example, see Nov. 2013 Hive Contributors Meetup Presentations – Using Dynamic Compilation with Hive.

Sample Usage:

  hive> set mapred.reduce.tasks=32;
  hive> set;
  hive> select a.* from tab1;
  hive> !ls;
  hive> dfs -ls;

Hive Command Line Options
To get help, run "hive -H" or "hive --help".
Usage (as it is in Hive 0.9.0):

usage: hive
 -d,--define           Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
 -e          SQL from command line
 -f                     SQL from files
 -H,--help                        Print help information
 -h                     Connecting to Hive Server on remote host
    --hiveconf    Use value for given property
    --hivevar          Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i                     Initialization SQL file
 -p                         Connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

Examples
See Variable Substitution for examples of using the hiveconf option.

Example of running a query from the command line

$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'

Example of setting Hive configuration variables

$HIVE_HOME/bin/hive -e 'select a.col from tab1 a' --hiveconf hive.exec.scratchdir=/home/my/hive_scratch  --hiveconf mapred.reduce.tasks=32

Example of dumping data out from a query into a file using silent mode

$HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt

Example of running a script non-interactively from local disk

$HIVE_HOME/bin/hive -f /home/my/hive-script.sql

Example of running a script non-interactively from a Hadoop supported filesystem (starting in Hive 0.14)

$HIVE_HOME/bin/hive -f hdfs://:/hive-script.sql
$HIVE_HOME/bin/hive -f s3://mys3bucket/s3-script.sql 

Example of running an initialization script before entering interactive mode

$HIVE_HOME/bin/hive -i /home/my/hive-init.sql
 /tmp//hive.log 

$HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console

hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target sends the logs to the standard error (instead of the log file).

Hive Batch Mode Commands

When $HIVE_HOME/bin/hive is run with the -e or -f option, it executes SQL commands in batch mode.
hive -e '' executes the query string.
hive -f  executes one or more SQL queries from a file.

Version 0.14

As of Hive 0.14,  can be from one of the Hadoop supported filesystems (HDFS, S3, etc.) as well.

$HIVE_HOME/bin/hive -f hdfs://:/hive-script.sql$HIVE_HOME/bin/hive -f s3://mys3bucket/s3-script.sql

  hive> add FILE /tmp/tt.py;
  hive> list FILES;
  /tmp/tt.py
  hive> select from networks a 
               MAP a.networkid 
               USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;

In remote mode HiveServer2 only accepts valid Thrift calls – even in HTTP mode, the message body contains Thrift payloads.

% beeline -u jdbc:hive2://localhost:10000/default -n scott -w password_file
Hive version 0.11.0-SNAPSHOT by Apache

Connecting to jdbc:hive2://localhost:10000/default
Beeline Command Options
The Beeline CLI supports these command line options:

Option

Description

Option

Description

-u 

The JDBC URL to connect to. Special characters in parameter values should be encoded with URL encoding if needed.

Usage: beeline -u db_URL 

-r

Reconnect to last used URL (if a user has previously used !connect to a URL and used !save to a beeline.properties file).

Usage: beeline -r

Version: 2.1.0 (HIVE-13670)

-n 

The username to connect as.

Usage: beeline -n valid_user

-p 

The password to connect as.

Usage: beeline -p valid_password

Optional password mode:

Starting Hive 2.2.0 (HIVE-13589) the argument for -p option is optional.

Usage : beeline -p [valid_password]

If the password is not provided after -p Beeline will prompt for the password while initiating the connection. When password is provided Beeline uses it initiate the connection without prompting.

-d 

The driver class to use.

Usage: beeline -d driver_class

-e 

Query that should be executed. Double or single quotes enclose the query string. This option can be specified multiple times.

Usage: beeline -e "query_string"

Support to run multiple SQL statements separated by semicolons in a single query_string: 1.2.0 (HIVE-9877)Bug fix (null pointer exception): 0.13.0 (HIVE-5765)Bug fix (--headerInterval not honored): 0.14.0 (HIVE-7647)
Bug fix (running -e in background): 1.3.0 and 2.0.0 (HIVE-6758); workaround available for earlier versions 
-f 
Script file that should be executed.

Usage: beeline -f filepath

Version: 0.12.0 (HIVE-4268)
Note: If the script contains tabs, query compilation fails in version 0.12.0. This bug is fixed in version 0.13.0 (HIVE-6359).
Bug fix (running -f in background): 1.3.0 and 2.0.0 (HIVE-6758); workaround available for earlier versions 
-i (or) --init 
The init files for initialization

Usage: beeline -i /tmp/initfile

Single file:

Version: 0.14.0 (HIVE-6561)

Multiple files:

Version: 2.1.0 (HIVE-11336)
-w (or) --password-file 
The password file to read password from.

Version: 1.2.0 (HIVE-7175)
-a (or) --authType 
The authentication type passed to the jdbc as an auth property

Version: 0.13.0 (HIVE-5155)
--property-file 
File to read configuration properties from

Usage: beeline --property-file /tmp/a

Version: 2.2.0 (HIVE-13964)
--hiveconf property=value
Use value for the given configuration property. Properties that are listed in hive.conf.restricted.list cannot be reset with hiveconf (see Restricted List and Whitelist).

Usage: beeline --hiveconf prop1=value1

Version: 0.13.0 (HIVE-6173)
--hivevar name=value
Hive variable name and value. This is a Hive-specific setting in which variables can be set at the session level and referenced in Hive commands or queries.

Usage: beeline --hivevar var1=value1
--color=[true/false]
Control whether color is used for display. Default is false.

Usage: beeline --color=true

(Not supported for Separated-Value Output formats. See HIVE-9770)
--showHeader=[true/false]
Show column names in query results (true) or not (false). Default is true.

Usage: beeline --showHeader=false
--headerInterval=ROWS
The interval for redisplaying column headers, in number of rows, when outputformat is table. Default is 100.

Usage: beeline --headerInterval=50

(Not supported for Separated-Value Output formats. See HIVE-9770)
--fastConnect=[true/false]
When connecting, skip building a list of all tables and columns for tab-completion of HiveQL statements (true) or build the list (false). Default is true.

Usage: beeline --fastConnect=false
--autoCommit=[true/false]
Enable/disable automatic transaction commit. Default is false.

Usage: beeline --autoCommit=true
--verbose=[true/false]
Show verbose error messages and debug information (true) or do not show (false). Default is false.

Usage: beeline --verbose=true
--showWarnings=[true/false]
Display warnings that are reported on the connection after issuing any HiveQL commands. Default is false.

Usage: beeline --showWarnings=true
--showDbInPrompt=[true/false]
Display the current database name in prompt. Default is false.

Usage: beeline --showDbInPrompt=true

Version: 2.2.0 (HIVE-14123)
--showNestedErrs=[true/false]
Display nested errors. Default is false.

Usage: beeline --showNestedErrs=true
--numberFormat=[pattern]
Format numbers using a DecimalFormat pattern.

Usage: beeline --numberFormat="#,###,##0.00"
--force=[true/false]
Continue running script even after errors (true) or do not continue (false). Default is false.

Usage: beeline--force=true
--maxWidth=MAXWIDTH
The maximum width to display before truncating data, in characters, when outputformat is table. Default is to query the terminal for current width, then fall back to 80.

Usage: beeline --maxWidth=150
--maxColumnWidth=MAXCOLWIDTH
The maximum column width, in characters, when outputformat is table. Default is 50 in Hive version 2.2.0+ (see HIVE-14135) or 15 in earlier versions.

Usage: beeline --maxColumnWidth=25
--silent=[true/false]
Reduce the amount of informational messages displayed (true) or not (false). It also stops displaying the log messages for the query from HiveServer2 (Hive 0.14 and later) and the HiveQL commands (Hive 1.2.0 and later). Default is false.

Usage: beeline --silent=true
--autosave=[true/false]
Automatically save preferences (true) or do not autosave (false). Default is false.

Usage: beeline --autosave=true
--outputformat=[table/vertical/csv/tsv/dsv/csv2/tsv2]
Format mode for result display. Default is table. See Separated-Value Output Formats below for description of recommended sv options.

Usage: beeline --outputformat=tsv

Version: dsv/csv2/tsv2 added in 0.14.0 (HIVE-8615)
--truncateTable=[true/false]
If true, truncates table column in the console when it exceeds console length.

Version: 0.14.0 (HIVE-6928)
--delimiterForDSV= DELIMITER
The delimiter for delimiter-separated values output format. Default is '|' character.

Version: 0.14.0 (HIVE-7390)
--isolation=LEVEL
Set the transaction isolation level to TRANSACTION_READ_COMMITTED
or TRANSACTION_SERIALIZABLE.
See the "Field Detail" section in the Java Connection documentation.

Usage: beeline --isolation=TRANSACTION_SERIALIZABLE
--nullemptystring=[true/false]
Use historic behavior of printing null as empty string (true) or use current behavior of printing null as NULL (false). Default is false.

Usage: beeline --nullemptystring=false

Version: 0.13.0 (HIVE-4485)

--incremental=[true/false]

Defaults to true from Hive 2.3 onwards, before it defaulted to false. When set to false, the entire result set is fetched and buffered before being displayed, yielding optimal display column sizing. When set to true, result rows are displayed immediately as they are fetched, yielding lower latency and memory usage at the price of extra display column padding. Setting --incremental=true is recommended if you encounter an OutOfMemory on the client side (due to the fetched result set size being large).
--incrementalBufferRows=NUMROWS
The number of rows to buffer when printing rows on stdout, defaults to 1000; only applicable if --incremental=true and --outputformat=table

Usage: beeline --incrementalBufferRows=1000

Version: 2.3.0 (HIVE-14170)
--maxHistoryRows=NUMROWS
The maximum number of rows to store Beeline history.

Version: 2.3.0 (HIVE-15166)
--delimiter=;
Set the delimiter for queries written in Beeline. Multi-char delimiters are allowed, but quotation marks, slashes, and -- are not allowed. Defaults to ;

Usage: beeline --delimiter=$$

Version: 3.0.0 (HIVE-10865)
--convertBinaryArrayToString=[true/false]
Display binary column data as string or as byte array. 

Usage: beeline --convertBinaryArrayToString=true

Version: 3.0.0 (HIVE-14786)

--help

Display a usage message.

Usage: beeline --help

nohup beeline --silent=true --showHeader=true --outputformat=dsv -f query.hql 

/tmp/output.log 2> /tmp/error.log &

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.connection.user

  hive

  beeline.hs2.connection.password

  hive

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.connection.hosts

  localhost:10000

  beeline.hs2.connection.principal

  hive/dummy-hostname@domain.com

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.connection.user

  hive

  beeline.hs2.connection.hiveconf

  hive.cli.print.current.db=true, hive.cli.print.header=true

  beeline.hs2.connection.hivevar

  testVarName1=value1, testVarName2=value2

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.jdbc.url.tcpUrl

  jdbc:hive2://localhost:10000/default;user=hive;password=hive

  beeline.hs2.jdbc.url.httpUrl

  jdbc:hive2://localhost:10000/default;user=hive;password=hive;transportMode=http;httpPath=cliservice

  beeline.hs2.jdbc.url.default

  tcpUrl

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.connection.user

  hive

  beeline.hs2.connection.password

  hive

Consider the following beeline-site.xml:

"1.0"?>

"text/xsl" href="configuration.xsl"?>

  beeline.hs2.jdbc.url.tcpUrl

  jdbc:hive2://localhost:10000/default

  beeline.hs2.jdbc.url.httpUrl

  jdbc:hive2://localhost:10000/default;transportMode=http;httpPath=cliservice

  beeline.hs2.jdbc.url.default

  tcpUrl

Using JDBC

You can use JDBC to access data stored in a relational database or other tabular format.

Load the HiveServer2 JDBC driver. As of 1.2.0 applications no longer need to explicitly load JDBC drivers using Class.forName().

For example:

Class.forName("org.apache.hive.jdbc.HiveDriver");

Connect to the database by creating a Connection object with the JDBC driver.

For example:

Connection cnct = DriverManager.getConnection("jdbc:hive2://:", "", "");

The default  is 10000. In non-secure configurations, specify a  for the query to run as. The  field value is ignored in non-secure mode.

Connection cnct = DriverManager.getConnection("jdbc:hive2://:", "", "");

In Kerberos secure mode, the user information is based on the Kerberos credentials.

Submit SQL to the database by creating a Statement object and using its executeQuery() method.

For example:

Statement stmt = cnct.createStatement();
ResultSet rset = stmt.executeQuery("SELECT foo FROM bar");

Process the result set, if necessary.

JDBC Client Sample Code

import java.sql.SQLException;

import java.sql.Connection;

import java.sql.ResultSet;

import java.sql.Statement;

import java.sql.DriverManager;

public class HiveJdbcClient {

  private static String driverName = "org.apache.hive.jdbc.HiveDriver";

  /**

   * @param args

   * @throws SQLException

   */

  public static void main(String[] args) throws SQLException {

      try {

      Class.forName(driverName);

    } catch (ClassNotFoundException e) {

      // TODO Auto-generated catch block

      e.printStackTrace();

      System.exit(1);

    }

    //replace "hive" here with the name of the user the queries should run as

    Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", "");

    Statement stmt = con.createStatement();

    String tableName = "testHiveDriverTable";

    stmt.execute("drop table if exists " + tableName);

    stmt.execute("create table " + tableName + " (key int, value string)");

    // show tables

    String sql = "show tables '" + tableName + "'";

    System.out.println("Running: " + sql);

    ResultSet res = stmt.executeQuery(sql);

    if (res.next()) {

      System.out.println(res.getString(1));

    }

       // describe table

    sql = "describe " + tableName;

    System.out.println("Running: " + sql);

    res = stmt.executeQuery(sql);

    while (res.next()) {

      System.out.println(res.getString(1) + "\t" + res.getString(2));

    }

    // load data into table

    // NOTE: filepath has to be local to the hive server

    // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line

    String filepath = "/tmp/a.txt";

    sql = "load data local inpath '" + filepath + "' into table " + tableName;

    System.out.println("Running: " + sql);

    stmt.execute(sql);

    // select * query

    sql = "select * from " + tableName;

    System.out.println("Running: " + sql);

    res = stmt.executeQuery(sql);

    while (res.next()) {

      System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));

    }

    // regular hive query

    sql = "select count(1) from " + tableName;

    System.out.println("Running: " + sql);

    res = stmt.executeQuery(sql);

    while (res.next()) {

      System.out.println(res.getString(1));

    }

  }

}

Running the JDBC Sample Code

# Then on the command-line

$ javac HiveJdbcClient.java

# To run the program using remote hiveserver in non-kerberos mode, we need the following jars in the classpath

# from hive/build/dist/lib

#     hive-jdbc*.jar

#     hive-service*.jar

#     libfb303-0.9.0.jar

#        libthrift-0.9.0.jar

#     log4j-1.2.16.jar

#     slf4j-api-1.6.1.jar

#    slf4j-log4j12-1.6.1.jar

#     commons-logging-1.0.4.jar

#

#

# To run the program using kerberos secure mode, we need the following jars in the classpath

#     hive-exec*.jar

#     commons-configuration-1.6.jar (This is not needed with Hadoop 2.6.x and later).

#  and from hadoop

#     hadoop-core*.jar (use hadoop-common*.jar for Hadoop 2.x)

#

# To run the program in embedded mode, we need the following additional jars in the classpath

# from hive/build/dist/lib

#     hive-exec*.jar

#     hive-metastore*.jar

#     antlr-runtime-3.0.1.jar

#     derby.jar

#     jdo2-api-2.1.jar

#     jpox-core-1.2.2.jar

#     jpox-rdbms-1.2.2.jar

# and from hadoop/build

#     hadoop-core*.jar

# as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, 

# and hadoop jars necessary to run MR jobs (eg lzo codec)

$ java -cp $CLASSPATH HiveJdbcClient

Alternatively, you can run the following bash script, which will seed the data file and build your classpath before invoking the client. The script adds all the additional jars needed for using HiveServer2 in embedded mode as well.

#!/bin/bash

HADOOP_HOME=/your/path/to/hadoop

HIVE_HOME=/your/path/to/hive

echo -e '1\x01foo' > /tmp/a.txt

echo -e '2\x01bar' >> /tmp/a.txt

HADOOP_CORE=$(ls $HADOOP_HOME/hadoop-core*.jar)

CLASSPATH=.:$HIVE_HOME/conf:$(hadoop classpath)

for i in ${HIVE_HOME}/lib/*.jar ; do

    CLASSPATH=$CLASSPATH:$i

done

java -cp $CLASSPATH HiveJdbcClient

VariableSubstitution

$ a=b

$ hive -e " describe $a "

$ bin/hive --hiveconf a=b -e 'set a; set hiveconf:a; \

create table if not exists b (col int); describe ${hiveconf:a}'

Hive history file=/tmp/edward/hive_job_log_edward_201011240906_1463048967.txt

a=b

hiveconf:a=b

OK

Time taken: 5.913 seconds

OK

col int

Time taken: 0.754 seconds

There are three namespaces for variables – hiveconf, system, and env

set x=myvalue

${hiveconf:x}

set zzz=5;

--  sets zzz=5

set zzz;

set system:xxx=5;

set system:xxx;

-- sets a system property xxx to 5

set system:yyy=${system:xxx};

set system:yyy;

-- sets yyy with value of xxx

set go=${hiveconf:zzz};

set go;

-- sets go base on value on zzz

set hive.variable.substitute=false;

set raw=${hiveconf:zzz};

set raw;

-- disable substitution set a value to the literal

set hive.variable.substitute=true;

EXPLAIN SELECT * FROM src where key=${hiveconf:zzz};

SELECT * FROM src where key=${hiveconf:zzz};

--use a variable in a query

set a=1;

set b=a;

set c=${hiveconf:${hiveconf:b}};

set c;

--uses nested variables. 

set jar=../lib/derby.jar;

add file ${hiveconf:jar};

list file;

delete file ${hiveconf:jar};

list file;

The HCatalog command line interface (CLI) can be invoked as HIVE_HOME=hive_home hcat_home/bin/hcat where hive_home is the directory where Hive has been installed and hcat_home is the directory where HCatalog has been installed.

HCatalog CLI
The HCatalog CLI supports these command line options:

Option

Usage

Description

hcat

Prints a usage message.

-D

hcat -Dkey=value ...

Passes the key-value pair to HCatalog as a Java System Property.

-e

hcat -e 'create table mytable(a int);' ...

Tells HCatalog to treat the following string as a DDL command and execute it.

-f

hcat -f myscript.hcatalog ...

Tells HCatalog that myscript.hcatalog is a file containing DDL commands to execute.

-g

hcat -g mygroup ...

Tells HCatalog that the table which needs to be created must have group "mygroup".

-p

hcat -p rwxr-xr-x ...

Tells HCatalog that the table which needs to be created must have permissions "rwxr-xr-x".

Note the following:
The -g and -p options are not mandatory.
Only one -e or -f option can be provided, not both.
The order of options is immaterial; you can specify the options in any order.

If no option is provided, then a usage message is printed:

Usage:  hcat  { -e "" | -f  }  [-g ] [-p ] [-D=]

ALTER INDEX ... REBUILD
CREATE TABLE ... AS SELECT
ALTER TABLE ... CONCATENATE
ALTER TABLE ARCHIVE/UNARCHIVE PARTITION
ANALYZE TABLE ... COMPUTE STATISTICS
IMPORT FROM ...
EXPORT TABLE

CREATE TABLE kst
  PARTITIONED BY (ds string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='http://schema_provider/kst.avsc');
CREATE TABLE kst (
    string1 string,
    string2 string,
    int1 int,
    boolean1 boolean,
    long1 bigint,
    float1 float,
    double1 double,
    inner_record1 struct,
    enum1 string,
    array1 array,
    map1 map,
    union1 uniontype,
    fixed1 binary,
    null1 void,
    unionnullint int,
    bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;
CREATE TABLE test_serializer(string1 STRING,
                             int1 INT,
                             tinyint1 TINYINT,
                             smallint1 SMALLINT,
                             bigint1 BIGINT,
                             boolean1 BOOLEAN,
                             float1 FLOAT,
                             double1 DOUBLE,
                             list1 ARRAY,
                             map1 MAP,
                             struct1 STRUCT,
                             union1 uniontype,
                             enum1 STRING,
                             nullableint INT,
                             bytes1 BINARY,
                             fixed1 BINARY)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;

{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
    { "name":"union1", "type":["float", "boolean", "string"] },
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
    { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" },
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }

CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');
  
INSERT OVERWRITE TABLE as_avro SELECT * FROM test_serializer;

CREATE TABLE as_avro(string1 STRING,
                     int1 INT,
                     tinyint1 TINYINT,
                     smallint1 SMALLINT,
                     bigint1 BIGINT,
                     boolean1 BOOLEAN,
                     float1 FLOAT,
                     double1 DOUBLE,
                     list1 ARRAY,
                     map1 MAP,
                     struct1 STRUCT,
                     union1 uniontype,
                     enum1 STRING,
                     nullableint INT,
                     bytes1 BINARY,
                     fixed1 BINARY)
STORED AS AVRO;
INSERT OVERWRITE TABLE as_avro SELECT * FROM test_serializer;
CREATE TABLE embedded
  COMMENT "just drop the schema right into the HQL"
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.literal'='{
      "namespace": "com.howdy",
      "name": "some_schema",
      "type": "record",
      "fields": [ { "name":"string1","type":"string"}]
    }');
hive --hiveconf schema="${SCHEMA}" -f your_script_file.sql

ORC Files

a single file as the output of each task, which reduces the NameNode's load
Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the fileskip row groups that don't pass predicate filtering
seek to a given row

block-mode compression based on data typerun-length encoding for integer columns
dictionary encoding for string columns

concurrent reads of the same file using separate RecordReaders
ability to split files without scanning for markers
bound the amount of memory needed for reading or writing
metadata stored using Protocol Buffers, which allows addition and removal of fields

An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.

This diagram illustrates the ORC file structure:

CREATE TABLE ... STORED AS ORC
ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
SET hive.default.fileformat=Orc

/ Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex ] 
  
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex ] 
  
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex ] [--recover] [--skip-dump] 
    [--backup-path ] 
Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the writer.
Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).
 is the URI of the ORC file.
 is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

PARQUET:

CREATE TABLE parquet_test (
 id int,
 str string,
 mp MAP,
 lst ARRAY,
 strct STRUCT) 
PARTITIONED BY (part string)
STORED AS PARQUET;

Compressed Data Storage
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)

INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;

core-site.xml
Add the following to your core-site.xml:
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

For example:

io.compression.codecs

org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

Command	Description
quit exit	Use quit or exit to leave the interactive shell.
reset	Resets the configuration to the default values (as of Hive 0.10: see HIVE-3202). Any configuration parameters that were set using the set command or -hiveconf parameter in hive commandline will get reset to default value. Note that this does not apply to configuration parameters that were set in set command using the "hiveconf:" prefix for the key name (for historic reasons).
set =	Sets the value of a particular configuration variable (key). Note: If you misspell the variable name, the CLI will not show an error.
set	Prints a list of configuration variables that are overridden by the user or Hive.
set -v	Prints all Hadoop and Hive configuration variables.
add FILE[S] * add JAR[S] * add ARCHIVE[S] *	Adds one or more files, jars, or archives to the list of resources in the distributed cache. See Hive Resources for more information.
add FILE[S] * add JAR[S] * add ARCHIVE[S] *	As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the distributed cache using an Ivy URL of the form ivy://group:module:version?query_string. See Hive Resources for more information.
list FILE[S] list JAR[S] list ARCHIVE[S]	Lists the resources already added to the distributed cache. See Hive Resources for more information.
list FILE[S] * list JAR[S] * list ARCHIVE[S] *	Checks whether the given resources are already added to the distributed cache or not. See Hive Resources for more information.
delete FILE[S] * delete JAR[S] * delete ARCHIVE[S] *	Removes the resource(s) from the distributed cache.
delete FILE[S] * delete JAR[S] * delete ARCHIVE[S] *	As of Hive 1.2.0, removes the resource(s) which were added using the from the distributed cache. See Hive Resources for more information.
!	Executes a shell command from the Hive shell.
dfs	Executes a dfs command from the Hive shell.
	Executes a Hive query and prints results to standard output.
source FILE	Executes a script file inside the CLI.
compile `` AS GROOVY NAMED	This allows inline Groovy code to be compiled and be used as a UDF (as of Hive 0.13.0). For a usage example, see Nov. 2013 Hive Contributors Meetup Presentations – Using Dynamic Compilation with Hive.

Option	Description
-u	The JDBC URL to connect to. Special characters in parameter values should be encoded with URL encoding if needed. Usage: `beeline -u` db_URL
-r	Reconnect to last used URL (if a user has previously used `!connect` to a URL and used `!save` to a beeline.properties file). Usage: `beeline -r` Version: 2.1.0 (HIVE-13670)
-n	The username to connect as. Usage: `beeline -n` valid_user
-p	The password to connect as. Usage: `beeline -p` valid_password Optional password mode: Starting Hive 2.2.0 (HIVE-13589) the argument for -p option is optional. Usage : beeline -p [valid_password] If the password is not provided after -p Beeline will prompt for the password while initiating the connection. When password is provided Beeline uses it initiate the connection without prompting.
-d	The driver class to use. Usage: `beeline -d` driver_class
-e	Query that should be executed. Double or single quotes enclose the query string. This option can be specified multiple times. Usage: `beeline -e "`query_string" Support to run multiple SQL statements separated by semicolons in a single query_string: 1.2.0 (HIVE-9877)Bug fix (null pointer exception): 0.13.0 (HIVE-5765)Bug fix (--headerInterval not honored): 0.14.0 (HIVE-7647) Bug fix (running `-e` in background): 1.3.0 and 2.0.0 (HIVE-6758); workaround available for earlier versions
-f	Script file that should be executed. Usage: `beeline -f` filepath Version: 0.12.0 (HIVE-4268) Note: If the script contains tabs, query compilation fails in version 0.12.0. This bug is fixed in version 0.13.0 (HIVE-6359). Bug fix (running `-f` in background): 1.3.0 and 2.0.0 (HIVE-6758); workaround available for earlier versions
-i (or) --init	The init files for initialization Usage: `beeline -i /tmp/initfile` Single file: Version: 0.14.0 (HIVE-6561) Multiple files: Version: 2.1.0 (HIVE-11336)
-w (or) --password-file	The password file to read password from. Version: 1.2.0 (HIVE-7175)
-a (or) --authType	The authentication type passed to the jdbc as an auth property Version: 0.13.0 (HIVE-5155)
--property-file	File to read configuration properties from Usage: `beeline --property-file /tmp/a` Version: 2.2.0 (HIVE-13964)
--hiveconf property=value	Use value for the given configuration property. Properties that are listed in hive.conf.restricted.list cannot be reset with hiveconf (see Restricted List and Whitelist). Usage: `beeline --hiveconf` prop1`=`value1 Version: 0.13.0 (HIVE-6173)
--hivevar name=value	Hive variable name and value. This is a Hive-specific setting in which variables can be set at the session level and referenced in Hive commands or queries. Usage: `beeline --hivevar` var1`=`value1
--color=[true/false]	Control whether color is used for display. Default is false. Usage: `beeline --color=true` (Not supported for Separated-Value Output formats. See HIVE-9770)
--showHeader=[true/false]	Show column names in query results (true) or not (false). Default is true. Usage: `beeline --showHeader=false`
--headerInterval=ROWS	The interval for redisplaying column headers, in number of rows, when outputformat is table. Default is 100. Usage: `beeline --headerInterval=50` (Not supported for Separated-Value Output formats. See HIVE-9770)
--fastConnect=[true/false]	When connecting, skip building a list of all tables and columns for tab-completion of HiveQL statements (true) or build the list (false). Default is true. Usage: `beeline --fastConnect=false`
--autoCommit=[true/false]	Enable/disable automatic transaction commit. Default is false. Usage: `beeline --autoCommit=true`
--verbose=[true/false]	Show verbose error messages and debug information (true) or do not show (false). Default is false. Usage: `beeline --verbose=true`
--showWarnings=[true/false]	Display warnings that are reported on the connection after issuing any HiveQL commands. Default is false. Usage: `beeline --showWarnings=true`
--showDbInPrompt=[true/false]	Display the current database name in prompt. Default is false. Usage: `beeline --showDbInPrompt=true` Version: 2.2.0 (HIVE-14123)
--showNestedErrs=[true/false]	Display nested errors. Default is false. Usage: `beeline --showNestedErrs=true`
--numberFormat=[pattern]	Format numbers using a DecimalFormat pattern. Usage: `beeline --numberFormat="#,###,##0.00"`
--force=[true/false]	Continue running script even after errors (true) or do not continue (false). Default is false. Usage: `beeline--force=true`
--maxWidth=MAXWIDTH	The maximum width to display before truncating data, in characters, when outputformat is table. Default is to query the terminal for current width, then fall back to 80. Usage: `beeline --maxWidth=150`
--maxColumnWidth=MAXCOLWIDTH	The maximum column width, in characters, when outputformat is table. Default is 50 in Hive version 2.2.0+ (see HIVE-14135) or 15 in earlier versions. Usage: `beeline --maxColumnWidth=25`
--silent=[true/false]	Reduce the amount of informational messages displayed (true) or not (false). It also stops displaying the log messages for the query from HiveServer2 (Hive 0.14 and later) and the HiveQL commands (Hive 1.2.0 and later). Default is false. Usage: `beeline --silent=true`
--autosave=[true/false]	Automatically save preferences (true) or do not autosave (false). Default is false. Usage: `beeline --autosave=true`
--outputformat=[table/vertical/csv/tsv/dsv/csv2/tsv2]	Format mode for result display. Default is table. See Separated-Value Output Formats below for description of recommended sv options. Usage: `beeline --outputformat=tsv` Version: dsv/csv2/tsv2 added in 0.14.0 (HIVE-8615)
--truncateTable=[true/false]	If true, truncates table column in the console when it exceeds console length. Version: 0.14.0 (HIVE-6928)
--delimiterForDSV= DELIMITER	The delimiter for delimiter-separated values output format. Default is '\|' character. Version: 0.14.0 (HIVE-7390)
--isolation=LEVEL	Set the transaction isolation level to TRANSACTION_READ_COMMITTED or TRANSACTION_SERIALIZABLE. See the "Field Detail" section in the Java Connection documentation. Usage: `beeline --isolation=TRANSACTION_SERIALIZABLE`
--nullemptystring=[true/false]	Use historic behavior of printing null as empty string (true) or use current behavior of printing null as NULL (false). Default is false. Usage: `beeline --nullemptystring=false` Version: 0.13.0 (HIVE-4485)
--incremental=[true/false]	Defaults to `true` from Hive 2.3 onwards, before it defaulted to `false.` When set to `false`, the entire result set is fetched and buffered before being displayed, yielding optimal display column sizing. When set to `true`, result rows are displayed immediately as they are fetched, yielding lower latency and memory usage at the price of extra display column padding. Setting `--incremental=true` is recommended if you encounter an OutOfMemory on the client side (due to the fetched result set size being large).
--incrementalBufferRows=NUMROWS	The number of rows to buffer when printing rows on stdout, defaults to 1000; only applicable if `--incremental=true` and `--outputformat=table` Usage: `beeline --incrementalBufferRows=1000` Version: 2.3.0 (HIVE-14170)
--maxHistoryRows=NUMROWS	The maximum number of rows to store Beeline history. Version: 2.3.0 (HIVE-15166)
--delimiter=;	Set the delimiter for queries written in Beeline. Multi-char delimiters are allowed, but quotation marks, slashes, and -- are not allowed. Defaults to ; Usage: `beeline --delimiter=$$` Version: 3.0.0 (HIVE-10865)
--convertBinaryArrayToString=[true/false]	Display binary column data as string or as byte array. Usage: `beeline --convertBinaryArrayToString=true` Version: 3.0.0 (HIVE-14786)
--help	Display a usage message. Usage: `beeline --help`

Option	Usage	Description
	`hcat`	Prints a usage message.
-D	`hcat -D`key`=`value `...`	Passes the key-value pair to HCatalog as a Java System Property.
-e	`hcat -e 'create table mytable(a int);' ...`	Tells HCatalog to treat the following string as a DDL command and execute it.
-f	`hcat -f myscript.hcatalog ...`	Tells HCatalog that myscript.hcatalog is a file containing DDL commands to execute.
-g	`hcat -g mygroup ...`	Tells HCatalog that the table which needs to be created must have group "mygroup".
-p	`hcat -p rwxr-xr-x ...`	Tells HCatalog that the table which needs to be created must have permissions "rwxr-xr-x".

io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec

Next we run the command to create an LZO index file:

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

Overview

HiveQL DDL statements are documented here, including:

CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX

DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX

TRUNCATE TABLE

ALTER DATABASE/SCHEMA, TABLE, VIEW

MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)

SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE

DESCRIBE DATABASE/SCHEMA, table_name, view_name, materialized_view_name

PARTITION statements are usually options of TABLE statements, except for SHOW PARTITIONS.

Keywords, Non-reserved Keywords and Reserved Keywords

	All Keywords
Version	Non-reserved Keywords	Reserved Keywords
Hive 1.2.0	ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ELEM_TYPE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, KEY_TYPE, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, UTCTIMESTAMP, VALUE_TYPE, VIEW, WHILE, YEAR	ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH
Hive 2.0.0	removed: `REGEXP, RLIKE` added: `AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT,` TRANSACTION, WORK, WRITE	added: `COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START`
Hive 2.1.0	added: `ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE`	added: `CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES`
Hive 2.2.0	added: `DETAIL, DOW, EXPRESSION, OPERATOR, QUARTER, SUMMARY, VECTORIZATION, WEEK, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS`	added: `DAYOFWEEK, EXTRACT, FLOOR, INTEGER, PRECISION, VIEWS`
Hive 3.0.0	added: `TIMESTAMPTZ, ZONE`	added: `TIME, NUMERIC, SYNC`

Version information

REGEXP and RLIKE are non-reserved keywords prior to Hive 2.0.0 and reserved keywords starting in Hive 2.0.0 (HIVE-11703).

Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

  [COMMENT database_comment]

  [LOCATION hdfs_path]

  [WITH DBPROPERTIES (property_name=property_value, ...)];

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)

Use Database

USE database_name;

USE DEFAULT;

Create Table

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)

  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]

  [COMMENT table_comment]

  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]

     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)

     [STORED AS DIRECTORIES]

  [

   [ROW FORMAT row_format] 

   [STORED AS file_format]

     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)

  ]

  [LOCATION hdfs_path]

  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)

  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name

  LIKE existing_table_or_view_name

  [LOCATION hdfs_path];

data_type

  : primitive_type

  | array_type

  | map_type

  | struct_type

  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type

  : TINYINT

  | SMALLINT

  | INT

  | BIGINT

  | BOOLEAN

  | FLOAT

  | DOUBLE

  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)

  | STRING

  | BINARY      -- (Note: Available in Hive 0.8.0 and later)

  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)

  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)

  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)

  | DATE        -- (Note: Available in Hive 0.12.0 and later)

  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)

  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type

  : ARRAY < data_type >

map_type

  : MAP < primitive_type, data_type >

struct_type

  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type

   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

row_format

  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]

        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)

  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:

  : SEQUENCEFILE

  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)

  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)

  | ORC         -- (Note: Available in Hive 0.11.0 and later)

  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)

  | AVRO        -- (Note: Available in Hive 0.14.0 and later)

  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)

  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

column_constraint_specification:

  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

default_value:

  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ]  

constraint_specification:

  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]

    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]

    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 

    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]

    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

TBLPROPERTIES ("comment"="table_comment")

TBLPROPERTIES ("hbase.table.name"="table_name") – see HBase Integration.

TBLPROPERTIES ("immutable"="true") or ("immutable"="false") in release 0.13.0+ (HIVE-6406) – see Inserting Data into Hive Tables from Queries.

TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE") and other ORC properties – see ORC Files.

TBLPROPERTIES ("transactional"="true") or ("transactional"="false") in release 0.14.0+, the default is "false" – see Hive Transactions.

TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false" – see Hive Transactions.

TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") – see Hive Transactions.

TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") – see Hive Transactions.

TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct") – see Hive Transactions.

TBLPROPERTIES ("auto.purge"="true") or ("auto.purge"="false") in release 1.2.0+ (HIVE-9118) – see Drop Table, Drop Partitions, Truncate Table, and Insert Overwrite.

TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) – Change a managed table to an external table and vice versa for "FALSE".






 

  
JSON 

  

  ROW FORMAT SERDE 

  'org.apache.hive.hcatalog.data.JsonSerDe' 

  STORED AS TEXTFILE

  
Stored as plain text file in JSON
  format.



  The JsonSerDe for JSON files is available in Hive 0.12 and later.

  

  

  

In some distributions, a reference to
  hive-hcatalog-core.jar is required.

ADD JAR
  /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;

  

  CREATE TABLE my_table(a string, b bigint,
  ...)

ROW FORMAT
  SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'

STORED AS TEXTFILE;



  The JsonSerDe was moved to Hive from HCatalog and before it
  was in hive-contrib project. It was added to the Hive distribution
  by HIVE-4895.

  An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.

  

  The JsonSerDe for JSON files is available in Hive 0.12 and later.

Starting in Hive 3.0.0, JsonSerDe
  is added to Hive Serde as "org.apache.hadoop.hive.serde2.JsonSerDe"
  (HIVE-19211).

CREATE TABLE my_table(a
  string, b bigint, ...)

  ROW FORMAT
  SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'

  STORED AS TEXTFILE;

Or STORED
  AS JSONFILE is supported starting in Hive
  4.0.0 (HIVE-19899), so you can create table as
  follows:

CREATE TABLE my_table(a string, b bigint, ...) STORED AS
  JSONFILE;

 

  
CSV/TSV

  

  ROW FORMAT SERDE 

  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 

  STORED AS TEXTFILE

  
Stored as plain text file in CSV /
  TSV format.

   

The CSVSerde is available in Hive 0.14 and greater.

The following example creates a TSV
  (Tab-separated) file.




CREATE TABLE my_table(a
  string, b string, ...)

  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

   "separatorChar" = "\t",

   "quoteChar"     = "'",

   "escapeChar"    = "\\"

)  

STORED AS TEXTFILE;

Default properties for SerDe is
  Comma-Separated (CSV) file

   

DEFAULT_ESCAPE_CHARACTER \

DEFAULT_QUOTE_CHARACTER 
  "

DEFAULT_SEPARATOR       
  ,

This SerDe works for most CSV data,
  but does not handle embedded newlines. To use the SerDe, specify the
  fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.  

Documentation is based on original documentation
  at https://github.com/ogrodnek/csv-serde.

  

  

Limitations

This SerDe treats all columns to be
  of type String. Even if you create a table with non-string column types using
  this SerDe, the DESCRIBE TABLE output would show string column type. 

  The type information is retrieved from the SerDe. 

  

  To convert columns to the desired type in a table, you can create a view over
  the table that does the CAST to the desired type.



  

  

The CSV SerDe is based on https://github.com/ogrodnek/csv-serde,
  and was added to the Hive distribution in HIVE-7777.

The CSVSerde has been built and
  tested against Hive 0.14 and later, and uses Open-CSV 2.3
  which is bundled with the Hive distribution.

  

  For general information about SerDes, see Hive SerDe in the Developer Guide.
  Also see SerDe for details about input
  and output processing.






 

   
Storage Format

   
Description

  

 

  
STORED AS TEXTFILE

  
Stored as plain text files. TEXTFILE
  is the default file format, unless the configuration parameter hive.default.fileformat has a different setting.

Use the DELIMITED clause to read delimited files.

Enable escaping for the delimiter characters by
  using the 'ESCAPED BY' clause (such as ESCAPED BY '\') 

  Escaping is needed if you want to work with data that can contain these
  delimiter characters. 

  

  A custom NULL format can also be specified using the 'NULL DEFINED AS' clause
  (default is '\N').

 

  
STORED AS SEQUENCEFILE

  
Stored as compressed Sequence File. 

 

  
STORED AS ORC

  
Stored as ORC file format. Supports ACID Transactions
  & Cost-based Optimizer (CBO). Stores column-level metadata. 

 

  
STORED AS PARQUET

  
Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later; 

  Use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT syntax ... in
  Hive 0.10, 0.11, or 0.12.

 

  
STORED AS AVRO

  
Stored as Avro format in Hive 0.14.0 and later (see Avro SerDe).

 

  
STORED AS RCFILE

  
Stored as Record
  Columnar File format.

 

  
STORED AS JSONFILE

  
Stored as Json file format in Hive
  4.0.0 and later.

 

  
STORED BY 

  
Stored by a non-native table format.
  To create or link to a non-native table, for example a table backed by HBase or Druid or Accumulo. 

  See StorageHandlers for more information on this
  option.

 

  
INPUTFORMAT and OUTPUTFORMAT

  
in the file_format to specify the
  name of a corresponding InputFormat and OutputFormat class as a string
  literal.

  

  For example,
  'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. 

  

  For LZO compression, the values to use are 

  'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" 

  OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"'
  

  

  (see LZO Compression).

Storage Format	Description
STORED AS TEXTFILE	Stored as plain text files. TEXTFILE is the default file format, unless the configuration parameter hive.default.fileformat has a different setting. Use the DELIMITED clause to read delimited files. Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\') Escaping is needed if you want to work with data that can contain these delimiter characters. A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N').
STORED AS SEQUENCEFILE	Stored as compressed Sequence File.
STORED AS ORC	Stored as ORC file format. Supports ACID Transactions & Cost-based Optimizer (CBO). Stores column-level metadata.
STORED AS PARQUET	Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later; Use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT syntax ... in Hive 0.10, 0.11, or 0.12.
STORED AS AVRO	Stored as Avro format in Hive 0.14.0 and later (see Avro SerDe).
STORED AS RCFILE	Stored as Record Columnar File format.
STORED AS JSONFILE	Stored as Json file format in Hive 4.0.0 and later.
STORED BY	Stored by a non-native table format. To create or link to a non-native table, for example a table backed by HBase or Druid or Accumulo. See StorageHandlers for more information on this option.
INPUTFORMAT and OUTPUTFORMAT	in the file_format to specify the name of a corresponding InputFormat and OutputFormat class as a string literal. For example, 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'. For LZO compression, the values to use are 'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"' (see LZO Compression).

Partitioned Tables

Example:
id     int,
date   date,
name   varchar
Example:
create table table_name (
  id                int,
  dtDontQuery       string,
  name              string
)
partitioned by (date string)

Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 STORED AS SEQUENCEFILE;
Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '';

CTAS has these restrictions:
The target table cannot be an external table.
The target table cannot be a list bucketing table.
Example:
CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

CREATE TABLE empty_key_value_store
LIKE key_value_store [TBLPROPERTIES (property_name=property_value, ...)];

Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE

CREATE TABLE list_bucket_single (key STRING, value STRING)
  SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
  SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];

Example:
CREATE TEMPORARY TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING);
A table that supports operations with ACID semantics. See this for more details about transactional tables.
Example:
CREATE TRANSACTIONAL TABLE transactional_table_test(key string, value string) PARTITIONED BY(ds string) STORED AS ORC;
create table pk(id1 integer, id2 integer,
  primary key(id1, id2) disable novalidate);

create table fk(id1 integer, id2 integer,
  constraint c1 foreign key(id1, id2) references pk(id2, id1) disable novalidate);
create table constraints1(id1 integer UNIQUE disable novalidate, id2 integer NOT NULL, 
  usr string DEFAULT current_user(), price double CHECK (price > 0 AND price <= 1000));

create table constraints2(id1 integer, id2 integer,
  constraint c1_unique UNIQUE(id1) disable novalidate);

create table constraints3(id1 integer, id2 integer,
  constraint c1_check CHECK(id1 + id2 > 0));
Drop Table
DROP TABLE [IF EXISTS] table_name [PURGE];     -- (Note: PURGE available in Hive 0.14.0 and later)
TRUNCATE TABLE table_name [PARTITION partition_spec];

partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

Bigdata_Spark_Consultant

Monday, September 30, 2019

Hive Expert part 1

Hive Command Line Options

Examples

Beeline Command Options

`Using JDBC`

`JDBC Client Sample Code`

`Running the JDBC Sample Code`

`VariableSubstitution $ a=b $ hive -e " describe $a " $ bin/hive --hiveconf a=b -e 'set a; set hiveconf:a; \ create table if not exists b (col int); describe ${hiveconf:a}'`

`HCatalog CLI`

`core-site.xml`

Overview

Keywords, Non-reserved Keywords and Reserved Keywords

All Keywords

Non-reserved Keywords

Reserved Keywords

Create Database

Drop Database

Alter Database

Use Database

Create Table

No comments:

Post a Comment

Python Challenges Program

Blog Archive

Labels