Monday, September 30, 2019
Sunday, September 29, 2019
HDFS commands and architecture
All hadoop commands are invoked by the bin/hadoop script
hadoop fs –help hadoop fs –help lshadoop fs –usage ls $ hadoop fs –ls / hadoop fs –ls
hadoop fs –ls –R /user/cloudera
hadoop fs –mkdir /user/cloudera/dezyre1
sudo –u hdfs hadoop fs –mkdir /dezyre
hadoop fs –copyFromLocal Sample1.txt /user/cloudera/dezyre1
hadoop fs –put Sample2.txt /user/cloudera/dezyre1
hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1
$ hadoop fs –du /user/cloudera/dezyre1
hadoop fs –df
hadoop fs –expunge
hadoop fs –cat /user/cloudera/dezyre1/Sample1.txt
hadoop fs –cp /user/cloudera/dezyre/war_and_peace /user/cloudera/dezyre1/
hadoop fs –mv /user/cloudera/dezyre1/Sample1.txt /user/cloudera/dezyre/
hadoop fs –rm -r /user/cloudera/dezyre3
hadoop fs –rm /user/cloudera/dezyre3
hadoop fs -tail /user/cloudera/dezyre/war_and_peace
hadoop fs -tail /user/cloudera/dezyre/war_and_peace
hadoop fs –copyToLocal /user/cloudera/dezyre1/Sample1.txt /home/cloudera/hdfs_bkp/
hadoop fs –get /user/cloudera/dezyre1/Sample2.txt /home/cloudera/hdfs_bkp/
hadoop fs –touchz /user/cloudera/dezyre1/Sample4.txt
hadoop fs –setrep –w 1 /user/cloudera/dezyre1/Sample1.txt
sudo –u hdfs hadoop fs –chgrp –R cloudera /dezyre
sudo –u hdfs hadoop fs –chown –R cloudera /dezyre
hadoop fs –chmod /dezyre
data merge.txt
hadoop fs –help hadoop fs –help lshadoop fs –usage ls $ hadoop fs –ls / hadoop fs –ls
hadoop fs –ls –R /user/cloudera
hadoop fs –mkdir /user/cloudera/dezyre1
sudo –u hdfs hadoop fs –mkdir /dezyre
hadoop fs –copyFromLocal Sample1.txt /user/cloudera/dezyre1
hadoop fs –put Sample2.txt /user/cloudera/dezyre1
hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1
$ hadoop fs –du /user/cloudera/dezyre1
hadoop fs –df
hadoop fs –expunge
hadoop fs –cat /user/cloudera/dezyre1/Sample1.txt
hadoop fs –cp /user/cloudera/dezyre/war_and_peace /user/cloudera/dezyre1/
hadoop fs –mv /user/cloudera/dezyre1/Sample1.txt /user/cloudera/dezyre/
hadoop fs –rm -r /user/cloudera/dezyre3
hadoop fs –rm /user/cloudera/dezyre3
hadoop fs -tail /user/cloudera/dezyre/war_and_peace
hadoop fs -tail /user/cloudera/dezyre/war_and_peace
hadoop fs –copyToLocal /user/cloudera/dezyre1/Sample1.txt /home/cloudera/hdfs_bkp/
hadoop fs –get /user/cloudera/dezyre1/Sample2.txt /home/cloudera/hdfs_bkp/
hadoop fs –touchz /user/cloudera/dezyre1/Sample4.txt
hadoop fs –setrep –w 1 /user/cloudera/dezyre1/Sample1.txt
sudo –u hdfs hadoop fs –chgrp –R cloudera /dezyre
sudo –u hdfs hadoop fs –chown –R cloudera /dezyre
hadoop fs –chmod /dezyre
data merge.txt
Saturday, September 28, 2019
SQOOP Expert
Sqoop:
Import and export from/to RDBMS and Mainframe.
sqoop-list-databases
sqoop-list-tables
sqoop-eval
sqoop help
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import mainframe datasets to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
sqoop-import sqoop-export
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect Specify JDBC connect string
--connect-manager Specify connection manager class to use
--driver Manually specify JDBC driver class to use
--hadoop-mapred-home
Override $HADOOP_MAPRED_HOME
--help Print usage instructions
--password-file Set path for file containing authentication password
-P Read password from console
--password Set authentication password
--username Set authentication username
--verbose Print more information while working
--hadoop-home Deprecated. Override $HADOOP_HOME
[...]
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf specify an application configuration file
-D use value for given property
-fs specify a namenode
-jt specify a job tracker
-files specify comma separated files to be copied to the map reduce cluster
-libjars specify comma separated jar files to include in the classpath.
-archives specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
--options-file
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST
where the options file /users/homer/work/import.txt contains the following:
import
--connect
jdbc:mysql://localhost/db
--username
foo
sqoop import --connect jdbc:mysql://database.example.com/employees
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username venkatesh --password-file ${user.home}/.password
echo -n "secret" > password.file.
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults $ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
The option
--skip-dist-cache
--as-textfile
You can compress your data by using the deflate (gzip) algorithm with the
$ sqoop import --connect --table SomeTable --package-name com.foocorp
$ sqoop import --table SomeTable --jar-file mydatatypes.jar \
--class-name SomeTableType sqoop import -D property.name=property.value …
$ sqoop-import-all-tables (generic-args) (import-args)
$ sqoop import-mainframe --connect z390 \
--username david --password-file ${user.home}/.password
$ sqoop import-mainframe --connnect --dataset foo --warehouse-dir /shared \
… $ sqoop import-mainframe --connnect --dataset foo --target-dir /dest \
sqoop import ... --null-string '\\N' --null-non-string '\\N
$ sqoop import-mainframe --connect --dataset SomePDS --package-name com.foocorp
$ sqoop import-mainframe --connect --dataset SomePDS --package-name com.foocorp
$ sqoop import-mainframe --connect --dataset SomePDS --package-name com.foocorp
$ sqoop import-mainframe --connect z390 --dataset EMPLOYEES \
--hive-import
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
CREATE TABLE foo(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
0,this is a test,42
1,some more data,100
… Running
sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data --validate $ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
--export-dir /results/bar_da ValidationThreshold - Determines if the error margin between the source and target are acceptable: Absolute, Percentage Tolerant, etc. Default
$ sqoop merge (generic-args) (merge-args)
$ sqoop-merge (generic-args) (merge-args)
sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
sqoop codegen --connect jdbc:mysql://db.example.com/corp \
--table employees
$ sqoop create-hive-table (generic-args) (create-hive-table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-table-args)
sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
sqoop-list-databases
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args) sqoop list-databases --connect jdbc:mysql://database.example.com/
information_schema
employees sqoop-list-tables$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args) $ sqoop list-tables --connect jdbc:mysql://database.example.com/corp
employees
payroll_checks
job_descriptions
office_supplies $ sqoop list-tables --connect jdbc:postgresql://localhost/corp --username name -P -- --schema payrolldept
employees
expenses $ sqoop version
$ sqoop-version sqoop version
Sqoop {revnumber}
git commit id 46b3e06b79a8411320d77c984c3030db47dd1c22
Compiled by aaron@jargon on Mon May 17 13:43:22 PDT 2010
Supported Databases
$ sqoop import --table foo \
--connect jdbc:mysql://db.example.com/someDb?zeroDateTimeBehavior=round Sqoop’s direct mode does not support imports of
$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
--connect jdbc:oracle:thin:@//db.example.com/foo --table bar
Import and export from/to RDBMS and Mainframe.
sqoop-list-databases
sqoop-list-tables
sqoop-eval
sqoop help
$ sqoop help
usage: sqoop COMMAND [ARGS]
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import mainframe datasets to HDFS
list-databases List available databases on a server
list-tables List available tables in a database
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
sqoop-import sqoop-export
$ sqoop help import
usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Common arguments:
--connect
--connect-manager
--driver
--hadoop-mapred-home
--help Print usage instructions
--password-file Set path for file containing authentication password
-P Read password from console
--password
--username
--verbose Print more information while working
--hadoop-home
[...]
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf
-D
-fs
-jt
-files
-libjars
-archives
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
--options-file
$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST
where the options file /users/homer/work/import.txt contains the following:
import
--connect
jdbc:mysql://localhost/db
--username
foo
Table 1. Common arguments
Argument | Description |
---|---|
--connect
| Specify JDBC connect string |
--connection-manager
| Specify connection manager class to use |
--driver
| Manually specify JDBC driver class to use |
--hadoop-mapred-home | Override $HADOOP_MAPRED_HOME |
--help
| Print usage instructions |
--password-file
| Set path for a file containing the authentication password |
-P
| Read password from console |
--password
| Set authentication password |
--username
| Set authentication username |
--verbose
| Print more information while working |
--connection-param-file
| Optional properties file that provides connection parameters |
--relaxed-isolation
| Set connection transaction isolation to read uncommitted for the mappers. |
sqoop import --connect jdbc:mysql://database.example.com/employees
$ sqoop import --connect jdbc:mysql://database.example.com/employees \
--username venkatesh --password-file ${user.home}/.password
echo -n "secret" > password.file.
Table 2. Validation arguments More Details
Argument | Description |
---|---|
--validate
| Enable validation of data copied, supports single table copy only. |
--validator
| Specify validator class to use. |
--validation-threshold
| Specify validation threshold class to use. |
--validation-failurehandler
| Specify validation failure handler class to use. |
Table 3. Import control arguments:
Argument | Description |
---|---|
--append
| Append data to an existing dataset in HDFS |
--as-avrodatafile
| Imports data to Avro Data Files |
--as-sequencefile
| Imports data to SequenceFiles |
--as-textfile
| Imports data as plain text (default) |
--as-parquetfile
| Imports data to Parquet Files |
--boundary-query
| Boundary query to use for creating splits |
--columns | |
Columns to import from table | |
--delete-target-dir
| Delete the import target directory if it exists |
--direct
| Use direct connector if exists for the database |
--fetch-size
| Number of entries to read from database at once. |
--inline-lob-limit
| Set the maximum size for an inline LOB |
-m,--num-mappers
| Use n map tasks to import in parallel |
-e,--query
| Import the results of statement .
|
--split-by
| Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
|
--autoreset-to-one-mapper
| Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by |
--table
| Table to read |
--target-dir | HDFS destination dir |
--warehouse-dir | HDFS parent for table destination |
--where
| WHERE clause to use during import |
-z,--compress
| Enable compression |
--compression-codec
| Use Hadoop codec (default gzip) |
--null-string
| The string to be written for a null value for string columns |
--null-non-string
| The string to be written for a null value for non-string columns |
The
$ sqoop import \--null-string
and --null-non-string
arguments are optional.\
If not specified, then the string "null" will be used.--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults $ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults
Note |
---|
If you are issuing the query wrapped with double quotes ("),
you will have to use
\$CONDITIONS instead of just $CONDITIONS
to disallow your shell from treating it as a shell variable.
For example, a double quoted query may look like:
"SELECT * FROM x WHERE a='foo' AND \$CONDITIONS" |
--autoreset-to-one-mapper
is typically used with the import-all-tables
tool to automatically handle tables without a primary key in a schema.--skip-dist-cache
Argument | Description |
---|---|
--map-column-java
| Override mapping from SQL to Java type for configured columns. |
--map-column-hive
| Override mapping from SQL to Hive type for configured columns. |
Sqoop is expecting comma separated list of mapping in form =. For example:
$ sqoop import ... --map-column-java id=String,value=Integer
Table 5. Incremental import arguments:
Argument | Description |
---|---|
--check-column (col)
| Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR) |
--incremental (mode)
| Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified .
|
--last-value (value)
| Specifies the maximum value of the check column from the previous import. |
You can compress your data by using the deflate (gzip) algorithm with the
-z
or --compress
argument, or specify any Hadoop compression codec using the
--compression-codec
argument. This applies to SequenceFile, text,
and Avro files.
Output line formatting arguments:
Argument | Description |
---|---|
--enclosed-by
| Sets a required field enclosing character |
--escaped-by
| Sets the escape character |
--fields-terminated-by
| Sets the field separator character |
--lines-terminated-by
| Sets the end-of-line character |
--mysql-delimiters
| Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
|
--optionally-enclosed-by
| Sets a field enclosing character |
Hive arguments:
Argument | Description |
---|---|
--hive-home | Override $HIVE_HOME
|
--hive-import
| Import tables into Hive (Uses Hive’s default delimiters if none are set.) |
--hive-overwrite
| Overwrite existing data in the Hive table. |
--create-hive-table
| If set, then the job will fail if the target hive |
table exits. By default this property is false. | |
--hive-table
| Sets the table name to use when importing to Hive. |
--hive-drop-import-delims
| Drops \n, \r, and \01 from string fields when importing to Hive. |
--hive-delims-replacement
| Replace \n, \r, and \01 from string fields with user defined string when importing to Hive. |
--hive-partition-key
| Name of a hive field to partition are sharded on |
--hive-partition-value
| String-value that serves as partition key for this imported into hive in this job. |
--map-column-hive
| Override default mapping from SQL type to Hive type for configured columns. |
--hive-import--hive-overwrite
sqoop import ... --null-string '\\N' --null-non-string '\\N'
Argument | Description |
---|---|
--column-family
| Sets the target column family for the import |
--hbase-create-table
| If specified, create missing HBase tables |
--hbase-row-key | |
Specifies which input column to use as the row key | |
In case, if input table contains composite | |
key, then | |
comma-separated list of composite key | |
attributes | |
--hbase-table
| Specifies an HBase table to use as the target instead of HDFS |
--hbase-bulkload
| Enables bulk loading |
Table 10. Accumulo arguments:
Argument | Description |
---|---|
--accumulo-table
| Specifies an Accumulo table to use as the target instead of HDFS |
--accumulo-column-family
| Sets the target column family for the import |
--accumulo-create-table
| If specified, create missing Accumulo tables |
--accumulo-row-key | |
Specifies which input column to use as the row key | |
--accumulo-visibility
| (Optional) Specifies a visibility token to apply to all rows inserted into Accumulo. Default is the empty string. |
--accumulo-batch-size
| (Optional) Sets the size in bytes of Accumulo’s write buffer. Default is 4MB. |
--accumulo-max-latency
| (Optional) Sets the max latency in milliseconds for the Accumulo batch writer. Default is 0. |
--accumulo-zookeepers
| Comma-separated list of Zookeeper servers used by the Accumulo instance |
--accumulo-instance
| Name of the target Accumulo instance |
--accumulo-user
| Name of the Accumulo user to import as |
--accumulo-password
| Password for the Accumulo user |
Argument | Description |
---|---|
--bindir | Output directory for compiled objects |
--class-name
| Sets the generated class name. This overrides --package-name . When combined with --jar-file , sets the input class.
|
--jar-file
| Disable code generation; use specified jar |
--outdir | Output directory for generated code |
--package-name
| Put auto-generated classes in this package |
--map-column-java
| Override default mapping from SQL type to Java type for configured columns. |
--class-name SomeTableType sqoop import -D property.name=property.value …
A basic import of a table named
EMPLOYEES
in the corp
database:$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
A basic import requiring a login:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --username SomeUser -P Enter password: (hidden)
Selecting specific columns from the
EMPLOYEES
table:$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --columns "employee_id,first_name,last_name,job_title"
Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ -m 8
Storing data in SequenceFiles, and setting the generated class name to
com.foocorp.Employee
:$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --class-name com.foocorp.Employee --as-sequencefile
Specifying the delimiters to use in a text-mode import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --fields-terminated-by '\t' --lines-terminated-by '\n' \ --optionally-enclosed-by '\"'
Importing the data to Hive:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --hive-import
Importing only new employees:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --where "start_date > '2010-01-01'"
Changing the splitting column from the default:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --split-by dept_id
Verifying that an import was successful:
$ hadoop fs -ls EMPLOYEES Found 5 items drwxr-xr-x - someuser somegrp 0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs -rw-r--r-- 1 someuser somegrp 2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000 -rw-r--r-- 1 someuser somegrp 1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001 -rw-r--r-- 1 someuser somegrp 7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002 -rw-r--r-- 1 someuser somegrp 7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003 $ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10 0,joe,smith,engineering 1,jane,doe,marketing ...
Performing an incremental import of new data, after having already
imported the first 100,000 rows of a table:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \ --where "id > 100000" --target-dir /incremental_dataset --append
An import of a table named
EMPLOYEES
in the corp
database that uses
validation to validate the import using the table row count and number of
rows copied into HDFS:
More Details$ sqoop import --connect jdbc:mysql://db.foo.com/corp \ --table EMPLOYEES --validate$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args)
Table 13. Common arguments
Argument | Description |
---|---|
--connect
| Specify JDBC connect string |
--connection-manager
| Specify connection manager class to use |
--driver
| Manually specify JDBC driver class to use |
--hadoop-mapred-home | Override $HADOOP_MAPRED_HOME |
--help
| Print usage instructions |
--password-file
| Set path for a file containing the authentication password |
-P
| Read password from console |
--password
| Set authentication password |
--username
| Set authentication username |
--verbose
| Print more information while working |
--connection-param-file
| Optional properties file that provides connection parameters |
--relaxed-isolation
| Set connection transaction isolation to read uncommitted for the mappers. |
Table 14. Import control arguments:
Argument | Description |
---|---|
--as-avrodatafile
| Imports data to Avro Data Files |
--as-sequencefile
| Imports data to SequenceFiles |
--as-textfile
| Imports data as plain text (default) |
--as-parquetfile
| Imports data to Parquet Files |
--direct
| Use direct import fast path |
--inline-lob-limit
| Set the maximum size for an inline LOB |
-m,--num-mappers
| Use n map tasks to import in parallel |
--warehouse-dir | HDFS parent for table destination |
-z,--compress
| Enable compression |
--compression-codec
| Use Hadoop codec (default gzip) |
--exclude-tables
| Comma separated list of tables to exclude from import process |
--autoreset-to-one-mapper
| Import should use one mapper if a table with no primary key is encountered |
--enclosed-by
| Sets a required field enclosing character |
--escaped-by
| Sets the escape character |
--fields-terminated-by
| Sets the field separator character |
--lines-terminated-by
| Sets the end-of-line character |
--mysql-delimiters
| Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: '
|
--optionally-enclosed-by
| Sets a field enclosing character |
Table 16. Input parsing arguments:
Argument | Description |
---|---|
--input-enclosed-by
| Sets a required field encloser |
--input-escaped-by
| Sets the input escape character |
--input-fields-terminated-by
| Sets the input field separator |
--input-lines-terminated-by
| Sets the input end-of-line character |
--input-optionally-enclosed-by
| Sets a field enclosing character |
Table 17. Hive arguments:
Argument | Description |
---|---|
--hive-home | Override $HIVE_HOME
|
--hive-import
| Import tables into Hive (Uses Hive’s default delimiters if none are set.) |
--hive-overwrite
| Overwrite existing data in the Hive table. |
--create-hive-table
| If set, then the job will fail if the target hive |
table exits. By default this property is false. | |
--hive-table
| Sets the table name to use when importing to Hive. |
--hive-drop-import-delims
| Drops \n, \r, and \01 from string fields when importing to Hive. |
--hive-delims-replacement
| Replace \n, \r, and \01 from string fields with user defined string when importing to Hive. |
--hive-partition-key
| Name of a hive field to partition are sharded on |
--hive-partition-value
| String-value that serves as partition key for this imported into hive in this job. |
--map-column-hive
| Override default mapping from SQL type to Hive type for configured columns. |
Table 18. Code generation arguments:
Argument | Description |
---|---|
--bindir | Output directory for compiled objects |
--jar-file
| Disable code generation; use specified jar |
--outdir | Output directory for generated code |
--package-name
| Put auto-generated classes in this package |
Import all tables from the
corp
database:$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
Verifying that it worked:
$ hadoop fs -ls Found 4 items drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/EMPLOYEES drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/PAYCHECKS drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/DEPARTMENTS drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
--username david --password-file ${user.home}/.password
… $ sqoop import-mainframe --connnect
--hive-import
$ sqoop-export (generic-args) (export-args)
Export control arguments:
Argument | Description |
---|---|
--columns | |
Columns to export to table | |
--direct
| Use direct export fast path |
--export-dir | HDFS source path for the export |
-m,--num-mappers
| Use n map tasks to export in parallel |
--table
| Table to populate |
--call
| Stored Procedure to call |
--update-key
| Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. |
--update-mode
| Specify how updates are performed when new rows are found with non-matching keys in database. |
Legal values for mode include updateonly (default) and allowinsert .
| |
--input-null-string
| The string to be interpreted as null for string columns |
--input-null-non-string
| The string to be interpreted as null for non-string columns |
--staging-table
| The table in which data will be staged before being inserted into the destination table. |
--clear-staging-table
| Indicates that any data present in the staging table can be deleted. |
--batch
| Use batch mode for underlying statement execution. |
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
1,some more data,100
… Running
sqoop-export --table foo --update-key id --export-dir
/path/to/data --connect …
Input parsing arguments:
Argument | Description |
---|---|
--input-enclosed-by
| Sets a required field encloser |
--input-escaped-by
| Sets the input escape character |
--input-fields-terminated-by
| Sets the input field separator |
--input-lines-terminated-by
| Sets the input end-of-line character |
--input-optionally-enclosed-by
| Sets a field enclosing character |
Failed Exports
Exports may fail for a number of reasons:
- Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
-
Attempting to
INSERT
a row which violates a consistency constraint (for example, inserting a duplicate primary key value) - Attempting to parse an incomplete or malformed record from the HDFS source data
- Attempting to parse records using incorrect delimiters
- Capacity issues (such as insufficient RAM or disk space)
--export-dir /results/bar_data --validate $ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
--export-dir /results/bar_da ValidationThreshold - Determines if the error margin between the source and target are acceptable: Absolute, Percentage Tolerant, etc. Default
ValidationFailureHandler - Responsible for handling failures: log an
error/warning, abort, etc.
Default implementation is LogOnFailureHandler that logs a warning message to
the configured logger.
Validator - Drives the validation logic by delegating the decision to
ValidationThreshold and delegating failure handling to ValidationFailureHandler.
The default implementation is RowCountValidator which validates the row
counts from source and the target.
Validator.
Property: validator Description: Driver for validation, must implement org.apache.sqoop.validation.Validator Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.RowCountValidator
Validation Threshold.
Property: validation-threshold Description: Drives the decision based on the validation meeting the threshold or not. Must implement org.apache.sqoop.validation.ValidationThreshold Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler.
Property: validation-failurehandler Description: Responsible for handling failures, must implement org.apache.sqoop.validation.ValidationFailureHandler Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.AbortOnFailureHandler
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate $ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data --validate sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--validate --validator org.apache.sqoop.validation.RowCountValidator \
--validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
--validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler $ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable sqoop job --list
Available jobs:
myjob sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
… sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
…
$ sqoop metastore (generic-args) (metastore-args)
$ sqoop-metastore (generic-args) (metastore-args)
Clients should connect to the metastore by specifying
--table EMPLOYEES --validate $ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data --validate sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--validate --validator org.apache.sqoop.validation.RowCountValidator \
--validation-threshold \
org.apache.sqoop.validation.AbsoluteValidationThreshold \
--validation-failurehandler \
org.apache.sqoop.validation.AbortOnFailureHandler $ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
Job management options:
Argument | Description |
---|---|
--create
| Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a -- should be specified; this defines the saved job.
|
--delete
| Delete a saved job. |
--exec
| Given a job defined with --create , run the saved job.
|
--show
| Show the parameters for a saved job. |
--list
| List all saved jobs |
--table mytable sqoop job --list
Available jobs:
myjob sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
… sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
…
The
exec
action allows you to override arguments of the saved job
by supplying them after a --
. For example, if the database were
changed to require a username, we could specify the username and
password with:$ sqoop job --exec myjob -- --username someuser -P
Enter password:
…
The
exec
action allows you to override arguments of the saved job
by supplying them after a --
. For example, if the database were
changed to require a username, we could specify the username and
password with:$ sqoop job --exec myjob -- --username someuser -P Enter password: …
Common options:
Argument | Description |
---|---|
--help
| Print usage instructions |
--verbose
| Print more information while working |
$ sqoop-metastore (generic-args) (metastore-args)
Argument | Description |
---|---|
--shutdown
| Shuts down a running metastore instance on the same machine. |
sqoop.metastore.client.autoconnect.url
or --meta-connect
with the
value jdbc:hsqldb:hsql://:/sqoop
. For example,
jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop
.$ sqoop-merge (generic-args) (merge-args)
Merge options:
Argument | Description |
---|---|
--class-name
| Specify the name of the record-specific class to use during the merge job. |
--jar-file
| Specify the name of the jar to load the record class from. |
--merge-key | |
Specify the name of a column to use as the merge key. | |
--new-data
| Specify the path of the newer dataset. |
--onto
| Specify the path of the older dataset. |
--target-dir
|
Specify the target path for the output of the merge job.
|
--jar-file datatypes.jar --class-name Foo --merge-key id
$ sqoop codegen (generic-args) (codegen-args)
$ sqoop-codegen (generic-args) (codegen-args)
sqoop codegen --connect jdbc:mysql://db.example.com/corp \
--table employees
$ sqoop create-hive-table (generic-args) (create-hive-table-args)
$ sqoop-create-hive-table (generic-args) (create-hive-table-args)
sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
$ sqoop eval (generic-args) (eval-args)
$ sqoop-eval (generic-args) (eval-args)
SQL evaluation arguments:
Argument | Description |
---|---|
-e,--query
| Execute statement in SQL.
|
sqoop eval --connect jdbc:mysql://db.example.com/corp \ --query "SELECT * FROM employees LIMIT 10"
Insert a row into the
foo
table:$ sqoop eval --connect jdbc:mysql://db.example.com/corp \ -e "INSERT INTO foo VALUES(42, 'bar')"
$ sqoop list-databases (generic-args) (list-databases-args)
$ sqoop-list-databases (generic-args) (list-databases-args) sqoop list-databases --connect jdbc:mysql://database.example.com/
information_schema
employees sqoop-list-tables$ sqoop list-tables (generic-args) (list-tables-args)
$ sqoop-list-tables (generic-args) (list-tables-args) $ sqoop list-tables --connect jdbc:mysql://database.example.com/corp
employees
payroll_checks
job_descriptions
office_supplies $ sqoop list-tables --connect jdbc:postgresql://localhost/corp --username name -P -- --schema payrolldept
employees
expenses $ sqoop version
$ sqoop-version sqoop version
Sqoop {revnumber}
git commit id 46b3e06b79a8411320d77c984c3030db47dd1c22
Compiled by aaron@jargon on Mon May 17 13:43:22 PDT 2010
Database | version | --direct support?
| connect string matches |
---|---|---|---|
HSQLDB | 1.8.0+ | No | jdbc:hsqldb:*//
|
MySQL | 5.0+ | Yes | jdbc:mysql://
|
Oracle | 10.2.0+ | No | jdbc:oracle:*//
|
PostgreSQL | 8.3+ | Yes (import only) | jdbc:postgresql://
|
CUBRID | 9.2+ | NO |
$ sqoop import --table foo \
--connect jdbc:mysql://db.example.com/someDb?zeroDateTimeBehavior=round Sqoop’s direct mode does not support imports of
BLOB
, CLOB
, or
LONGVARBINARY
column$ sqoop import -D oracle.sessionTimeZone=America/Los_Angeles \
--connect jdbc:oracle:thin:@//db.example.com/foo --table bar
Subscribe to:
Comments (Atom)
Python Challenges Program
Challenges program: program 1: #Input :ABAABBCA #Output: A4B3C1 str1="ABAABBCA" str2="" d={} for x in str1: d[x]=d...
-
Conditional Functions Return Type Name(Signature) Description T if(boolean testCondition, T valueTrue, T valueFalseOrN...
-
PYSPARK Regular Expression Operations read data from hdfs data is unstructured text data we have to clean the data(regular expressio...
-
Alter Table Rename Table ALTER TABLE table_name RENAME TO new_table_name; Alter Table Properties ALTER TABLE table_name SET TBLPROPERTI...