Monday, September 30, 2019

Hive Expert part 1



Commands are non-SQL statements such as setting a property or adding a resource. They can be used in HiveQL scripts or directly in the CLI or Beeline.
Command
Description
Command
Description
quit
exit
Use quit or exit to leave the interactive shell.
reset
Resets the configuration to the default values (as of Hive 0.10: see HIVE-3202). Any configuration parameters that were set using the set command or -hiveconf parameter in hive commandline will get reset to default value.
Note that this does not apply to configuration parameters that were set in set command using the "hiveconf:" prefix for the key name (for historic reasons).
set =
Sets the value of a particular configuration variable (key).
Note: If you misspell the variable name, the CLI will not show an error.
set
Prints a list of configuration variables that are overridden by the user or Hive.
set -v
Prints all Hadoop and Hive configuration variables.
add FILE[S] *
add JAR[S] *
add ARCHIVE[S] *
Adds one or more files, jars, or archives to the list of resources in the distributed cache. See Hive Resources for more information.
add FILE[S]
add JAR[S] 
add ARCHIVE[S] *
As of Hive 1.2.0, adds one or more files, jars or archives to the list of resources in the distributed cache using an Ivy URL of the form ivy://group:module:version?query_string. See Hive Resources for more information.
list FILE[S]
list JAR[S]
list ARCHIVE[S]
Lists the resources already added to the distributed cache. See Hive Resources for more information.
list FILE[S] *
list JAR[S] *
list ARCHIVE[S] *
Checks whether the given resources are already added to the distributed cache or not. See Hive Resources for more information.
delete FILE[S] *
delete JAR[S] *
delete ARCHIVE[S] *
Removes the resource(s) from the distributed cache.
delete FILE[S]
delete JAR[S]
delete ARCHIVE[S] *
As of Hive 1.2.0, removes the resource(s) which were added using the from the distributed cache. See Hive Resources for more information.
!
Executes a shell command from the Hive shell.
dfs
Executes a dfs command from the Hive shell.
Executes a Hive query and prints results to standard output.
source FILE
Executes a script file inside the CLI.
compile `` AS GROOVY NAMED
This allows inline Groovy code to be compiled and be used as a UDF (as of Hive 0.13.0). For a usage example, see Nov. 2013 Hive Contributors Meetup Presentations – Using Dynamic Compilation with Hive.
Sample Usage:
  hive> set mapred.reduce.tasks=32;
  hive> set;
  hive> select a.* from tab1;
  hive> !ls;
  hive> dfs -ls;

Hive Command Line Options

To get help, run "hive -H" or "hive --help". Usage (as it is in Hive 0.9.0):
usage: hive
 -d,--define           Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
 -e          SQL from command line
 -f                     SQL from files
 -H,--help                        Print help information
 -h                     Connecting to Hive Server on remote host
    --hiveconf    Use value for given property
    --hivevar          Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i                     Initialization SQL file
 -p                         Connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

Examples

See Variable Substitution for examples of using the hiveconf option.
  • Example of running a query from the command line
    $HIVE_HOME/bin/hive -e 'select a.col from tab1 a'
  • Example of setting Hive configuration variables
    $HIVE_HOME/bin/hive -e 'select a.col from tab1 a' --hiveconf hive.exec.scratchdir=/home/my/hive_scratch  --hiveconf mapred.reduce.tasks=32
  • Example of dumping data out from a query into a file using silent mode
    $HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt
  • Example of running a script non-interactively from local disk
    $HIVE_HOME/bin/hive -f /home/my/hive-script.sql
  • Example of running a script non-interactively from a Hadoop supported filesystem (starting in Hive 0.14)
    $HIVE_HOME/bin/hive -f hdfs://:/hive-script.sql
    $HIVE_HOME/bin/hive -f s3://mys3bucket/s3-script.sql 
  • Example of running an initialization script before entering interactive mode
    $HIVE_HOME/bin/hive -i /home/my/hive-init.sql
  •  /tmp//hive.log 
$HIVE_HOME/bin/hive --hiveconf hive.root.logger=INFO,console
hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target sends the logs to the standard error (instead of the log file).
Hive Batch Mode Commands
When $HIVE_HOME/bin/hive is run with the -e or -f option, it executes SQL commands in batch mode.
  • hive -e '' executes the query string.
  • hive -f  executes one or more SQL queries from a file.
Version 0.14

As of Hive 0.14, can be from one of the Hadoop supported filesystems (HDFS, S3, etc.) as well.
$HIVE_HOME/bin/hive -f hdfs://:/hive-script.sql$HIVE_HOME/bin/hive -f s3://mys3bucket/s3-script.sql
  hive> add FILE /tmp/tt.py;
  hive> list FILES;
  /tmp/tt.py
  hive> select from networks a 
               MAP a.networkid 
               USING 'python tt.py' as nn where a.ds = '2009-01-04' limit 10;



In remote mode HiveServer2 only accepts valid Thrift calls – even in HTTP mode, the message body contains Thrift payloads.

% beeline -u jdbc:hive2://localhost:10000/default -n scott -w password_file
Hive version 0.11.0-SNAPSHOT by Apache

Connecting to jdbc:hive2://localhost:10000/default

Beeline Command Options

The Beeline CLI supports these command line options:
Option
Description
Option
Description
-u 
The JDBC URL to connect to. Special characters in parameter values should be encoded with URL encoding if needed.
Usage: beeline -u db_URL 
-r
Reconnect to last used URL (if a user has previously used !connect to a URL and used !save to a beeline.properties file).
Usage: beeline -r
Version: 2.1.0 (HIVE-13670)
-n 
The username to connect as.
Usage: beeline -n valid_user
-p 
The password to connect as.
Usage: beeline -p valid_password
Optional password mode:
Starting Hive 2.2.0 (HIVE-13589) the argument for -p option is optional.
Usage : beeline -p [valid_password]
If the password is not provided after -p Beeline will prompt for the password while initiating the connection. When password is provided Beeline uses it initiate the connection without prompting.
-d 
The driver class to use.
Usage: beeline -d driver_class
-e 
Query that should be executed. Double or single quotes enclose the query string. This option can be specified multiple times.
Usage: beeline -e "query_string"
Support to run multiple SQL statements separated by semicolons in a single query_string: 1.2.0 (HIVE-9877)Bug fix (null pointer exception): 0.13.0 (HIVE-5765)Bug fix (--headerInterval not honored): 0.14.0 (HIVE-7647) Bug fix (running -e in background): 1.3.0 and 2.0.0 (HIVE-6758); workaround available for earlier versions 
-f 
Script file that should be executed.
Usage: beeline -f filepath
Version: 0.12.0 (HIVE-4268) Note: If the script contains tabs, query compilation fails in version 0.12.0. This bug is fixed in version 0.13.0 (HIVE-6359). Bug fix (running -f in background): 1.3.0 and 2.0.0 (HIVE-6758)workaround available for earlier versions 
-i (or) --init 
The init files for initialization
Usage: beeline -i /tmp/initfile
Single file:
Version: 0.14.0 (HIVE-6561)
Multiple files:
Version: 2.1.0 (HIVE-11336)
-w (or) --password-file 
The password file to read password from.
Version: 1.2.0 (HIVE-7175)
-a (or) --authType 
The authentication type passed to the jdbc as an auth property
Version: 0.13.0 (HIVE-5155)
--property-file 
File to read configuration properties from
Usage: beeline --property-file /tmp/a
Version: 2.2.0 (HIVE-13964)
--hiveconf property=value
Use value for the given configuration property. Properties that are listed in hive.conf.restricted.list cannot be reset with hiveconf (see Restricted List and Whitelist).
Usage: beeline --hiveconf prop1=value1
Version: 0.13.0 (HIVE-6173)
--hivevar name=value
Hive variable name and value. This is a Hive-specific setting in which variables can be set at the session level and referenced in Hive commands or queries.
Usage: beeline --hivevar var1=value1
--color=[true/false]
Control whether color is used for display. Default is false.
Usage: beeline --color=true
(Not supported for Separated-Value Output formats. See HIVE-9770)
--showHeader=[true/false]
Show column names in query results (true) or not (false). Default is true.
Usage: beeline --showHeader=false
--headerInterval=ROWS
The interval for redisplaying column headers, in number of rows, when outputformat is table. Default is 100.
Usage: beeline --headerInterval=50
(Not supported for Separated-Value Output formats. See HIVE-9770)
--fastConnect=[true/false]
When connecting, skip building a list of all tables and columns for tab-completion of HiveQL statements (true) or build the list (false). Default is true.
Usage: beeline --fastConnect=false
--autoCommit=[true/false]
Enable/disable automatic transaction commit. Default is false.
Usage: beeline --autoCommit=true
--verbose=[true/false]
Show verbose error messages and debug information (true) or do not show (false). Default is false.
Usage: beeline --verbose=true
--showWarnings=[true/false]
Display warnings that are reported on the connection after issuing any HiveQL commands. Default is false.
Usage: beeline --showWarnings=true
--showDbInPrompt=[true/false]
Display the current database name in prompt. Default is false.
Usage: beeline --showDbInPrompt=true
Version: 2.2.0 (HIVE-14123)
--showNestedErrs=[true/false]
Display nested errors. Default is false.
Usage: beeline --showNestedErrs=true
--numberFormat=[pattern]
Format numbers using a DecimalFormat pattern.
Usage: beeline --numberFormat="#,###,##0.00"
--force=[true/false]
Continue running script even after errors (true) or do not continue (false). Default is false.
Usage: beeline--force=true
--maxWidth=MAXWIDTH
The maximum width to display before truncating data, in characters, when outputformat is table. Default is to query the terminal for current width, then fall back to 80.
Usage: beeline --maxWidth=150
--maxColumnWidth=MAXCOLWIDTH
The maximum column width, in characters, when outputformat is table. Default is 50 in Hive version 2.2.0+ (see HIVE-14135) or 15 in earlier versions.
Usage: beeline --maxColumnWidth=25
--silent=[true/false]
Reduce the amount of informational messages displayed (true) or not (false). It also stops displaying the log messages for the query from HiveServer2 (Hive 0.14 and later) and the HiveQL commands (Hive 1.2.0 and later). Default is false.
Usage: beeline --silent=true
--autosave=[true/false]
Automatically save preferences (true) or do not autosave (false). Default is false.
Usage: beeline --autosave=true
--outputformat=[table/vertical/csv/tsv/dsv/csv2/tsv2]
Format mode for result display. Default is table. See Separated-Value Output Formats below for description of recommended sv options.
Usage: beeline --outputformat=tsv
Version: dsv/csv2/tsv2 added in 0.14.0 (HIVE-8615)
--truncateTable=[true/false]
If true, truncates table column in the console when it exceeds console length.
Version: 0.14.0 (HIVE-6928)
--delimiterForDSV= DELIMITER
The delimiter for delimiter-separated values output format. Default is '|' character.
Version: 0.14.0 (HIVE-7390)
--isolation=LEVEL
Set the transaction isolation level to TRANSACTION_READ_COMMITTED or TRANSACTION_SERIALIZABLE. See the "Field Detail" section in the Java Connection documentation.
Usage: beeline --isolation=TRANSACTION_SERIALIZABLE
--nullemptystring=[true/false]
Use historic behavior of printing null as empty string (true) or use current behavior of printing null as NULL (false). Default is false.
Usage: beeline --nullemptystring=false
Version: 0.13.0 (HIVE-4485)
--incremental=[true/false]
Defaults to true from Hive 2.3 onwards, before it defaulted to falseWhen set to false, the entire result set is fetched and buffered before being displayed, yielding optimal display column sizing. When set to true, result rows are displayed immediately as they are fetched, yielding lower latency and memory usage at the price of extra display column padding. Setting --incremental=true is recommended if you encounter an OutOfMemory on the client side (due to the fetched result set size being large).
--incrementalBufferRows=NUMROWS
The number of rows to buffer when printing rows on stdout, defaults to 1000; only applicable if --incremental=true and --outputformat=table
Usage: beeline --incrementalBufferRows=1000
Version: 2.3.0 (HIVE-14170)
--maxHistoryRows=NUMROWS
The maximum number of rows to store Beeline history.
Version: 2.3.0 (HIVE-15166)
--delimiter=;
Set the delimiter for queries written in Beeline. Multi-char delimiters are allowed, but quotation marks, slashes, and -- are not allowed. Defaults to ;
Usage: beeline --delimiter=$$
Version: 3.0.0 (HIVE-10865)
--convertBinaryArrayToString=[true/false]
Display binary column data as string or as byte array. 
Usage: beeline --convertBinaryArrayToString=true
Version: 3.0.0 (HIVE-14786)
--help
Display a usage message.
Usage: beeline --help
nohup beeline --silent=true --showHeader=true --outputformat=dsv -f query.hql
/tmp/output.log 2> /tmp/error.log &
"1.0"?>
"text/xsl" href="configuration.xsl"?>
  beeline.hs2.connection.user
  hive
  beeline.hs2.connection.password
  hive


"1.0"?>

"text/xsl" href="configuration.xsl"?>



  beeline.hs2.connection.hosts

  localhost:10000
  beeline.hs2.connection.principal
  hive/dummy-hostname@domain.com


"1.0"?>

"text/xsl" href="configuration.xsl"?>



  beeline.hs2.connection.user

  hive

  beeline.hs2.connection.hiveconf
  hive.cli.print.current.db=true, hive.cli.print.header=true
  beeline.hs2.connection.hivevar
  testVarName1=value1, testVarName2=value2

"1.0"?>
"text/xsl" href="configuration.xsl"?>
  beeline.hs2.jdbc.url.tcpUrl
  jdbc:hive2://localhost:10000/default;user=hive;password=hive
  beeline.hs2.jdbc.url.httpUrl
  jdbc:hive2://localhost:10000/default;user=hive;password=hive;transportMode=http;httpPath=cliservice
  beeline.hs2.jdbc.url.default
  tcpUrl


"1.0"?>
"text/xsl" href="configuration.xsl"?>
  beeline.hs2.connection.user
  hive
  beeline.hs2.connection.password
  hive
Consider the following beeline-site.xml:
"1.0"?>
"text/xsl" href="configuration.xsl"?>
  beeline.hs2.jdbc.url.tcpUrl
  jdbc:hive2://localhost:10000/default
  beeline.hs2.jdbc.url.httpUrl
  jdbc:hive2://localhost:10000/default;transportMode=http;httpPath=cliservice
  beeline.hs2.jdbc.url.default
  tcpUrl


Using JDBC


You can use JDBC to access data stored in a relational database or other tabular format.

  1. Load the HiveServer2 JDBC driver. As of 1.2.0 applications no longer need to explicitly load JDBC drivers using Class.forName().

    For example:
    Class.forName("org.apache.hive.jdbc.HiveDriver");
    
  2. Connect to the database by creating a Connection object with the JDBC driver.

    For example:
    Connection cnct = DriverManager.getConnection("jdbc:hive2://:", "", "");
    
    The default  is 10000. In non-secure configurations, specify a  for the query to run as. The  field value is ignored in non-secure mode.
    Connection cnct = DriverManager.getConnection("jdbc:hive2://:", "", "");
    
    In Kerberos secure mode, the user information is based on the Kerberos credentials.
  3. Submit SQL to the database by creating a Statement object and using its executeQuery() method.

    For example:
    Statement stmt = cnct.createStatement();
    ResultSet rset = stmt.executeQuery("SELECT foo FROM bar");
    
  4. Process the result set, if necessary.

JDBC Client Sample Code

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
  private static String driverName = "org.apache.hive.jdbc.HiveDriver";
  /**
   * @param args
   * @throws SQLException
   */
  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driverName);
    catch (ClassNotFoundException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
      System.exit(1);
    }
    //replace "hive" here with the name of the user the queries should run as
    Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default""hive""");
    Statement stmt = con.createStatement();
    String tableName = "testHiveDriverTable";
    stmt.execute("drop table if exists " + tableName);
    stmt.execute("create table " + tableName + " (key int, value string)");
    // show tables
    String sql = "show tables '" + tableName + "'";
    System.out.println("Running: " + sql);
    ResultSet res = stmt.executeQuery(sql);
    if (res.next()) {
      System.out.println(res.getString(1));
    }
       // describe table
    sql = "describe " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    while (res.next()) {
      System.out.println(res.getString(1) + "\t" + res.getString(2));
    }
    // load data into table
    // NOTE: filepath has to be local to the hive server
    // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
    String filepath = "/tmp/a.txt";
    sql = "load data local inpath '" + filepath + "' into table " + tableName;
    System.out.println("Running: " + sql);
    stmt.execute(sql);
    // select * query
    sql = "select * from " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    while (res.next()) {
      System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
    }
    // regular hive query
    sql = "select count(1) from " + tableName;
    System.out.println("Running: " + sql);
    res = stmt.executeQuery(sql);
    while (res.next()) {
      System.out.println(res.getString(1));
    }
  }
}

Running the JDBC Sample Code

# Then on the command-line
$ javac HiveJdbcClient.java
# To run the program using remote hiveserver in non-kerberos mode, we need the following jars in the classpath
# from hive/build/dist/lib
#     hive-jdbc*.jar
#     hive-service*.jar
#     libfb303-0.9.0.jar
#        libthrift-0.9.0.jar
#     log4j-1.2.16.jar
#     slf4j-api-1.6.1.jar
#    slf4j-log4j12-1.6.1.jar
#     commons-logging-1.0.4.jar
#
#
# To run the program using kerberos secure mode, we need the following jars in the classpath
#     hive-exec*.jar
#     commons-configuration-1.6.jar (This is not needed with Hadoop 2.6.x and later).
#  and from hadoop
#     hadoop-core*.jar (use hadoop-common*.jar for Hadoop 2.x)
#
# To run the program in embedded mode, we need the following additional jars in the classpath
# from hive/build/dist/lib
#     hive-exec*.jar
#     hive-metastore*.jar
#     antlr-runtime-3.0.1.jar
#     derby.jar
#     jdo2-api-2.1.jar
#     jpox-core-1.2.2.jar
#     jpox-rdbms-1.2.2.jar
# and from hadoop/build
#     hadoop-core*.jar
# as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, 
# and hadoop jars necessary to run MR jobs (eg lzo codec)
$ java -cp $CLASSPATH HiveJdbcClient
Alternatively, you can run the following bash script, which will seed the data file and build your classpath before invoking the client. The script adds all the additional jars needed for using HiveServer2 in embedded mode as well.
#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive
echo -e '1\x01foo' /tmp/a.txt
echo -e '2\x01bar' >> /tmp/a.txt
HADOOP_CORE=$(ls $HADOOP_HOME/hadoop-core*.jar)
CLASSPATH=.:$HIVE_HOME/conf:$(hadoop classpath)
for in ${HIVE_HOME}/lib/*.jar ; do
    CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH HiveJdbcClient

$ a=b
$ hive -e " describe $a "
$ bin/hive --hiveconf a=b -e 'set a; set hiveconf:a; \
create table if not exists b (col int); describe ${hiveconf:a}'

Hive history file=/tmp/edward/hive_job_log_edward_201011240906_1463048967.txt
a=b
hiveconf:a=b
OK
Time taken: 5.913 seconds
OK
col int
Time taken: 0.754 seconds

There are three namespaces for variables – hiveconf, system, and env

set x=myvalue
${hiveconf:x}

set zzz=5;
--  sets zzz=5
set zzz;
set system:xxx=5;
set system:xxx;
-- sets a system property xxx to 5
set system:yyy=${system:xxx};
set system:yyy;
-- sets yyy with value of xxx
set go=${hiveconf:zzz};
set go;
-- sets go base on value on zzz
set hive.variable.substitute=false;
set raw=${hiveconf:zzz};
set raw;
-- disable substitution set a value to the literal
set hive.variable.substitute=true;
EXPLAIN SELECT * FROM src where key=${hiveconf:zzz};
SELECT * FROM src where key=${hiveconf:zzz};
--use a variable in a query
set a=1;
set b=a;
set c=${hiveconf:${hiveconf:b}};
set c;
--uses nested variables.
set jar=../lib/derby.jar;
add file ${hiveconf:jar};
list file;
delete file ${hiveconf:jar};
list file;

The HCatalog command line interface (CLI) can be invoked as HIVE_HOME=hive_home hcat_home/bin/hcat where hive_home is the directory where Hive has been installed and hcat_home is the directory where HCatalog has been installed.


HCatalog CLI

The HCatalog CLI supports these command line options:
Option
Usage
Description

hcat
Prints a usage message.
-D
hcat -Dkey=value ...
Passes the key-value pair to HCatalog as a Java System Property.
-e
hcat -e 'create table mytable(a int);' ...
Tells HCatalog to treat the following string as a DDL command and execute it.
-f
hcat -f myscript.hcatalog ...
Tells HCatalog that myscript.hcatalog is a file containing DDL commands to execute.
-g
hcat -g mygroup ...
Tells HCatalog that the table which needs to be created must have group "mygroup".
-p
hcat -p rwxr-xr-x ...
Tells HCatalog that the table which needs to be created must have permissions "rwxr-xr-x".
Note the following:
  • The -g and -p options are not mandatory.
  • Only one -e or -f option can be provided, not both.
  • The order of options is immaterial; you can specify the options in any order.
If no option is provided, then a usage message is printed:
Usage:  hcat  { -e "" | -f  }  [-g ] [-p ] [-D=]
  • ALTER INDEX ... REBUILD
  • CREATE TABLE ... AS SELECT
  • ALTER TABLE ... CONCATENATE
  • ALTER TABLE ARCHIVE/UNARCHIVE PARTITION
  • ANALYZE TABLE ... COMPUTE STATISTICS
  • IMPORT FROM ...
  • EXPORT TABLE



CREATE TABLE kst
  PARTITIONED BY (ds string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='http://schema_provider/kst.avsc');

CREATE TABLE kst (
    string1 string,
    string2 string,
    int1 int,
    boolean1 boolean,
    long1 bigint,
    float1 float,
    double1 double,
    inner_record1 struct,
    enum1 string,
    array1 array,
    map1 map,
    union1 uniontype,
    fixed1 binary,
    null1 void,
    unionnullint int,
    bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;

CREATE TABLE test_serializer(string1 STRING,
                             int1 INT,
                             tinyint1 TINYINT,
                             smallint1 SMALLINT,
                             bigint1 BIGINT,
                             boolean1 BOOLEAN,
                             float1 FLOAT,
                             double1 DOUBLE,
                             list1 ARRAY,
                             map1 MAP,
                             struct1 STRUCT,
                             union1 uniontype,
                             enum1 STRING,
                             nullableint INT,
                             bytes1 BINARY,
                             fixed1 BINARY)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;


{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
    { "name":"union1", "type":["float", "boolean", "string"] },
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
    { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" },
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }


CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');
 
INSERT OVERWRITE TABLE as_avro SELECT * FROM test_serializer;


CREATE TABLE as_avro(string1 STRING,
                     int1 INT,
                     tinyint1 TINYINT,
                     smallint1 SMALLINT,
                     bigint1 BIGINT,
                     boolean1 BOOLEAN,
                     float1 FLOAT,
                     double1 DOUBLE,
                     list1 ARRAY,
                     map1 MAP,
                     struct1 STRUCT,
                     union1 uniontype,
                     enum1 STRING,
                     nullableint INT,
                     bytes1 BINARY,
                     fixed1 BINARY)
STORED AS AVRO;
INSERT OVERWRITE TABLE as_avro SELECT * FROM test_serializer;

CREATE TABLE embedded
  COMMENT "just drop the schema right into the HQL"
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.literal'='{
      "namespace": "com.howdy",
      "name": "some_schema",
      "type": "record",
      "fields": [ { "name":"string1","type":"string"}]
    }');

hive --hiveconf schema="${SCHEMA}" -f your_script_file.sql

ORC Files

  • a single file as the output of each task, which reduces the NameNode's load
  • Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
  • light-weight indexes stored within the file
    • skip row groups that don't pass predicate filtering
    • seek to a given row
  • block-mode compression based on data type
    • run-length encoding for integer columns
    • dictionary encoding for string columns
  • concurrent reads of the same file using separate RecordReaders
  • ability to split files without scanning for markers
  • bound the amount of memory needed for reading or writing
  • metadata stored using Protocol Buffers, which allows addition and removal of fields


An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer. At the end of the file a postscript holds compression parameters and the size of the compressed footer.

The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS.

The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.

This diagram illustrates the ORC file structure:


  • CREATE TABLE ... STORED AS ORC
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc


/ Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex ]
 
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex ]
 
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex ] [--recover] [--skip-dump]
    [--backup-path ]
Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the writer.
Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).
is the URI of the ORC file.
is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

PARQUET:

CREATE TABLE parquet_test (
id int,
str string,
mp MAP,
lst ARRAY,
strct STRUCT)
PARTITIONED BY (part string)
STORED AS PARQUET;

Compressed Data Storage
CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

CREATE TABLE raw (line STRING)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

CREATE TABLE raw_sequence (line STRING)
   STORED AS SEQUENCEFILE;

LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw;

SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)

INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;




core-site.xml

Add the following to your core-site.xml:
  • com.hadoop.compression.lzo.LzoCodec
  • com.hadoop.compression.lzo.LzopCodec
For example:

io.compression.codecs
org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec

Next we run the command to create an LZO index file:

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

Overview


HiveQL DDL statements are documented here, including:

  • CREATE DATABASE/SCHEMA, TABLE, VIEW, FUNCTION, INDEX
  • DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
  • TRUNCATE TABLE
  • ALTER DATABASE/SCHEMA, TABLE, VIEW
  • MSCK REPAIR TABLE (or ALTER TABLE RECOVER PARTITIONS)
  • SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS, PARTITIONS, FUNCTIONS, INDEX[ES], COLUMNS, CREATE TABLE
  • DESCRIBE DATABASE/SCHEMA, table_name, view_name, materialized_view_name

PARTITION statements are usually options of TABLE statements, except for SHOW PARTITIONS.

Keywords, Non-reserved Keywords and Reserved Keywords


All Keywords

Version

Non-reserved Keywords

Reserved Keywords

Hive 1.2.0
ADD, ADMIN, AFTER, ANALYZE, ARCHIVE, ASC, BEFORE, BUCKET, BUCKETS, CASCADE, CHANGE, CLUSTER, CLUSTERED, CLUSTERSTATUS, COLLECTION, COLUMNS, COMMENT, COMPACT, COMPACTIONS, COMPUTE, CONCATENATE, CONTINUE, DATA, DATABASES, DATETIME, DAY, DBPROPERTIES, DEFERRED, DEFINED, DELIMITED, DEPENDENCY, DESC, DIRECTORIES, DIRECTORY, DISABLE, DISTRIBUTE, ELEM_TYPE, ENABLE, ESCAPED, EXCLUSIVE, EXPLAIN, EXPORT, FIELDS, FILE, FILEFORMAT, FIRST, FORMAT, FORMATTED, FUNCTIONS, HOLD_DDLTIME, HOUR, IDXPROPERTIES, IGNORE, INDEX, INDEXES, INPATH, INPUTDRIVER, INPUTFORMAT, ITEMS, JAR, KEYS, KEY_TYPE, LIMIT, LINES, LOAD, LOCATION, LOCK, LOCKS, LOGICAL, LONG, MAPJOIN, MATERIALIZED, METADATA, MINUS, MINUTE, MONTH, MSCK, NOSCAN, NO_DROP, OFFLINE, OPTION, OUTPUTDRIVER, OUTPUTFORMAT, OVERWRITE, OWNER, PARTITIONED, PARTITIONS, PLUS, PRETTY, PRINCIPALS, PROTECTION, PURGE, READ, READONLY, REBUILD, RECORDREADER, RECORDWRITER, REGEXP, RELOAD, RENAME, REPAIR, REPLACE, REPLICATION, RESTRICT, REWRITE, RLIKE, ROLE, ROLES, SCHEMA, SCHEMAS, SECOND, SEMI, SERDE, SERDEPROPERTIES, SERVER, SETS, SHARED, SHOW, SHOW_DATABASE, SKEWED, SORT, SORTED, SSL, STATISTICS, STORED, STREAMTABLE, STRING, STRUCT, TABLES, TBLPROPERTIES, TEMPORARY, TERMINATED, TINYINT, TOUCH, TRANSACTIONS, UNARCHIVE, UNDO, UNIONTYPE, UNLOCK, UNSET, UNSIGNED, URI, USE, UTC, UTCTIMESTAMP, VALUE_TYPE, VIEW, WHILE, YEAR
ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH, BY, CASE, CAST, CHAR, COLUMN, CONF, CREATE, CROSS, CUBE, CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DECIMAL, DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED, EXTERNAL, FALSE, FETCH, FLOAT, FOLLOWING, FOR, FROM, FULL, FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE, NONE, NOT, NULL, OF, ON, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN, PARTITION, PERCENT, PRECEDING, PRESERVE, PROCEDURE, RANGE, READS, REDUCE, REVOKE, RIGHT, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT, TABLE, TABLESAMPLE, THEN, TIMESTAMP, TO, TRANSFORM, TRIGGER, TRUE, TRUNCATE, UNBOUNDED, UNION, UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, WHEN, WHERE, WINDOW, WITH
Hive 2.0.0
removed: REGEXP, RLIKE
added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START
Hive 2.1.0
added: ABORT, KEY, LAST, NORELY, NOVALIDATE, NULLS, RELY, VALIDATE
added: CACHE, CONSTRAINT, FOREIGN, PRIMARY, REFERENCES
Hive 2.2.0
added: DETAIL, DOW, EXPRESSION, OPERATOR, QUARTER, SUMMARY, VECTORIZATION, WEEK, YEARS, MONTHS, WEEKS, DAYS, HOURS, MINUTES, SECONDS
added: DAYOFWEEK, EXTRACT, FLOOR, INTEGER, PRECISION, VIEWS
Hive 3.0.0
added: TIMESTAMPTZ, ZONE added: TIME, NUMERIC, SYNC
Version information

REGEXP and RLIKE are non-reserved keywords prior to Hive 2.0.0 and reserved keywords starting in Hive 2.0.0 (HIVE-11703).

Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)
  
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)

Use Database

USE database_name;
USE DEFAULT;

Create Table



?
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
array_type
  : ARRAY < data_type >
map_type
  : MAP < primitive_type, data_type >
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

  • TBLPROPERTIES ("comment"="table_comment")
  • TBLPROPERTIES ("hbase.table.name"="table_name") – see HBase Integration.
  • TBLPROPERTIES ("immutable"="true") or ("immutable"="false") in release 0.13.0+ (HIVE-6406) – see Inserting Data into Hive Tables from Queries.
  • TBLPROPERTIES ("orc.compress"="ZLIB") or ("orc.compress"="SNAPPY") or ("orc.compress"="NONE") and other ORC properties – see ORC Files.
  • TBLPROPERTIES ("transactional"="true") or ("transactional"="false") in release 0.14.0+, the default is "false" – see Hive Transactions.
  • TBLPROPERTIES ("NO_AUTO_COMPACTION"="true") or ("NO_AUTO_COMPACTION"="false"), the default is "false" – see Hive Transactions.
  • TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="mapper_memory") – see Hive Transactions.
  • TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.num.threshold"="threshold_num") – see Hive Transactions.
  • TBLPROPERTIES ("compactorthreshold.hive.compactor.delta.pct.threshold"="threshold_pct") – see Hive Transactions.
  • TBLPROPERTIES ("auto.purge"="true") or ("auto.purge"="false") in release 1.2.0+ (HIVE-9118) – see Drop Table, Drop Partitions, Truncate Table, and Insert Overwrite.
  • TBLPROPERTIES ("EXTERNAL"="TRUE") in release 0.6.0+ (HIVE-1329) Change a managed table to an external table and vice versa for "FALSE".

JSON

ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
Stored as plain text file in JSON format.

The JsonSerDe for JSON files is available in Hive 0.12 and later.


In some distributions, a reference to hive-hcatalog-core.jar is required.
ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;

CREATE
 TABLE my_table(a string, b bigint, ...)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

The JsonSerDe was moved to Hive from HCatalog and before it was in hive-contrib project. It was added to the Hive distribution by HIVE-4895.
An Amazon SerDe is available at 
s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.

The JsonSerDe for JSON files is available in Hive 0.12 and later.
Starting in Hive 3.0.0, JsonSerDe is added to Hive Serde as "org.apache.hadoop.hive.serde2.JsonSerDe" (HIVE-19211).
CREATE TABLE my_table(a string, b bigint, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;
Or STORED AS JSONFILE is supported starting in Hive 4.0.0 (HIVE-19899), so you can create table as follows:
CREATE TABLE my_table(a string, b bigint, ...) STORED AS JSONFILE;
CSV/TSV

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
Stored as plain text file in CSV / TSV format.
 
The CSVSerde is available in Hive 0.14 and greater.
The following example creates a TSV (Tab-separated) file.

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar"     = "'",
   "escapeChar"    = "\\"
)  
STORED AS TEXTFILE;
Default properties for SerDe is Comma-Separated (CSV) file
 
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  "
DEFAULT_SEPARATOR        ,
This SerDe works for most CSV data, but does not handle embedded newlines. To use the SerDe, specify the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde.  
Documentation is based on original documentation at https://github.com/ogrodnek/csv-serde.

Limitations
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type.
The type information is retrieved from the SerDe.

To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.


The CSV SerDe is based on https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777.
The CSVSerde has been built and tested against Hive 0.14 and later, and uses Open-CSV 2.3 which is bundled with the Hive distribution.

For general information about SerDes, see 
Hive SerDe in the Developer Guide. Also see SerDe for details about input and output processing.


Storage Format
Description
STORED AS TEXTFILE
Stored as plain text files. TEXTFILE is the default file format, unless the configuration parameter hive.default.fileformat has a different setting.
Use the DELIMITED clause to read delimited files.
Enable escaping for the delimiter characters by using the 'ESCAPED BY' clause (such as ESCAPED BY '\')
Escaping is needed if you want to work with data that can contain these delimiter characters.

A custom NULL format can also be specified using the 'NULL DEFINED AS' clause (default is '\N').
STORED AS SEQUENCEFILE
Stored as compressed Sequence File.
STORED AS ORC
Stored as ORC file format. Supports ACID Transactions & Cost-based Optimizer (CBO). Stores column-level metadata.
STORED AS PARQUET
Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later;
Use ROW FORMAT SERDE ... STORED AS INPUTFORMAT ... OUTPUTFORMAT syntax ... in
Hive 0.10, 0.11, or 0.12.
STORED AS AVRO
Stored as Avro format in Hive 0.14.0 and later (see Avro SerDe).
STORED AS RCFILE
Stored as Record Columnar File format.
STORED AS JSONFILE
Stored as Json file format in Hive 4.0.0 and later.
STORED BY
Stored by a non-native table format. To create or link to a non-native table, for example a table backed by HBase or Druid or Accumulo.
See
StorageHandlers for more information on this option.
INPUTFORMAT and OUTPUTFORMAT
in the file_format to specify the name of a corresponding InputFormat and OutputFormat class as a string literal.

For example, 'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.

For LZO compression, the values to use are
'INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"'

(see LZO Compression).


Partitioned Tables
Example:
id     int,
date   date,
name   varchar
Example:
create table table_name (
  id                int,
  dtDontQuery       string,
  name              string
)
partitioned by (date string)

Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 STORED AS SEQUENCEFILE;
Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
STORED AS SEQUENCEFILE;
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '';

CTAS has these restrictions:
The target table cannot be an external table.
The target table cannot be a list bucketing table.
Example:
CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

CREATE TABLE empty_key_value_store
LIKE key_value_store [TBLPROPERTIES (property_name=property_value, ...)];

Example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User')
 COMMENT 'This is the page view table'
 PARTITIONED BY(dt STRING, country STRING)
 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
 ROW FORMAT DELIMITED
   FIELDS TERMINATED BY '\001'
   COLLECTION ITEMS TERMINATED BY '\002'
   MAP KEYS TERMINATED BY '\003'
 STORED AS SEQUENCEFILE

CREATE TABLE list_bucket_single (key STRING, value STRING)
  SKEWED BY (key) ON (1,5,6) [STORED AS DIRECTORIES];
CREATE TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING)
  SKEWED BY (col1, col2) ON (('s1',1), ('s3',3), ('s13',13), ('s78',78)) [STORED AS DIRECTORIES];

Example:
CREATE TEMPORARY TABLE list_bucket_multiple (col1 STRING, col2 int, col3 STRING);
A table that supports operations with ACID semantics. See this for more details about transactional tables.
Example:
CREATE TRANSACTIONAL TABLE transactional_table_test(key string, value string) PARTITIONED BY(ds string) STORED AS ORC;
create table pk(id1 integer, id2 integer,
  primary key(id1, id2) disable novalidate);

create table fk(id1 integer, id2 integer,
  constraint c1 foreign key(id1, id2) references pk(id2, id1) disable novalidate);
create table constraints1(id1 integer UNIQUE disable novalidate, id2 integer NOT NULL,
  usr string DEFAULT current_user(), price double CHECK (price > 0 AND price <= 1000));

create table constraints2(id1 integer, id2 integer,
  constraint c1_unique UNIQUE(id1) disable novalidate);

create table constraints3(id1 integer, id2 integer,
  constraint c1_check CHECK(id1 + id2 > 0));
Drop Table
DROP TABLE [IF EXISTS] table_name [PURGE];     -- (Note: PURGE available in Hive 0.14.0 and later)
TRUNCATE TABLE table_name [PARTITION partition_spec];

partition_spec:
  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)





No comments:

Post a Comment

Python Challenges Program

Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...