Share my learning's: 3)Sqoop Tool1: sqoop-import:

Tool 1: sqoop-import:

Using the import tool imports an individual table from an RDBMS to HDFS.
Each row from a table is represented as a separate record in HDFS.
Records can be stored as text files (i.e,one record per line), or in binary representation as Avro or SequenceFiles.

Tool Syntax:
sqoop import (generic-args) (import tool-args)
or
sqoop-import (generic-args) (import tool-args)

Arguments are grouped into collections organized by function:
Following below:

Common arguments
Validation arguments More Details
Import control arguments:
Parameters for overriding mapping
Incremental import arguments:
Output line formatting arguments:
Input parsing arguments:
Hive arguments:
HBase arguments:
Accumulo arguments:
Code generation arguments:
Additional import configuration properties:

We will cover each collection of arguments one by one,

1.Common arguments

Argument	Description
--connect <jdbc-uri>	To Specify JDBC connect string containing hostname or IP address (optionally port) followed by database name. It is mandatory argument. Example:–connect jdbc:mysql://localhost/sqoop_db
--connection-manager <class-name>	Specify connection manager class name It is optional. Example: –connection-manager org.apache.sqoop.manager.GenericJdbcManager
--driver <class-name>	Manually specify JDBC driver class to use. Example: com.mysql.jdbc.Driver
--hadoop-mapred-home <dir>	Override $HADOOP_MAPRED_HOME
--help	Print usage instructions
--password-file	Set path for a file containing the authentication password
-P	Read password from console
--password <password>	Set authentication password
--username <username>	Set authentication username
--verbose	Print more information while working
--connection-param-file <filename>	Optional properties file that provides connection parameters
--relaxed-isolation	Set connection transaction isolation to read uncommitted for the mappers.

Example 1: will cover the following arguments --connect, --connection-manager, --driver

hadoop@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --connection-manager org.apache.sqoop.manager.GenericJdbcManager --driver com.mysql.jdbc.Driver --username root --password root --table student -m1;

Example 2: a file containing the authentication password using arugument --password-file

Steps to create password file:

Step 1:create a password file
hadoop@Mano:~$ echo -n "root">.password
Note:
-n - Do not output a trailing newline.

Step 2: move the password to HDFS location
hadoop@Mano:~$ hadoop fs -put .password /user/hadoop

Command:
hadoop@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root
--password-file /user/hadoop/.password --table student -m 1;

Example 3:Read password from console

Command:
hadoop@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root -P --table student -m 1;

Example 4: using argument --verbose, most prefer more debugging

Command:
hadoop@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root -password root --table student -m 1 --verbose;

Example 5: using argument --connection-param-file , which hold the connection related key - value pairs

like this ==>sqoop.export.records.per.statement=1 save as separate .properties file and load using that argument

Status of Common arguments:

Argument	Description	Example
--connect <jdbc-uri>	To Specify JDBC connect string containing hostname or IP address (optionally port) followed by database name. It is mandatory argument. Example:–connect jdbc:mysql://localhost/sqoop_db	Done
--connection-manager <class-name>	Specify connection manager class name It is optional. Example: –connection-manager org.apache.sqoop.manager.GenericJdbcManager	Done
--driver <class-name>	Manually specify JDBC driver class to use. Example: com.mysql.jdbc.Driver	Done
--hadoop-mapred-home <dir>	Override $HADOOP_MAPRED_HOME	Need to try
--help	Print usage instructions	Done
--password-file	Set path for a file containing the authentication password	Done
-P	Read password from console	Done
--password <password>	Set authentication password	Done
--username <username>	Set authentication username	Done
--verbose	Print more information while working	Done
--connection-param-file <filename>	Optional properties file that provides connection parameters	Need to try
--relaxed-isolation	Set connection transaction isolation to read uncommitted for the mappers.	Need to try

2.Validation arguments for More details:

Argument	Description
--validate	Enable validation of data copied, supports single table copy only.
--validator <class-name>	Specify validator class to use.
--validation-threshold <class-name>	Specify validation threshold class to use.
--validation-failurehandler <class-name>	Specify validation failure handler class to use.

Example 6: using argument --validate

Command:
mano@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --validate;

At the end of execute of tool, we can see the status of data validation

17/08/31 10:45:49 INFO mapreduce.JobBase: Validating the integrity of the import using the following configuration
    Validator : org.apache.sqoop.validation.RowCountValidator
    Threshold Specifier : org.apache.sqoop.validation.AbsoluteValidationThreshold
    Failure Handler : org.apache.sqoop.validation.AbortOnFailureHandler

17/08/31 10:45:49 INFO validation.RowCountValidator: Data successfully validated

Example 7: using argument --validator <class>

Command:
mano@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --validator org.apache.sqoop.validation.RowCountValidator;

Example 8: using argument --validation-threshold <class>

Command:
mano@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --validation-threshold org.apache.sqoop.validation.AbsoluteValidationThreshold;

Example 9: using argument --validation-failurehandler <class>

Command:
mano@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --validation-failurehandler org.apache.sqoop.validation.AbortOnFailureHandler

Validation arguments status table:

Argument	Description	Example
--validate	Enable validation of data copied, supports single table copy only.	Done
--validator <class-name>	Specify validator class to use.	Done
--validation-threshold <class-name>	Specify validation threshold class to use.	Done
--validation-failurehandler <class-name>	Specify validation failure handler class to use.	Done

3.Import control arguments:

Argument	Description
--append	Append data to an existing dataset in HDFS
--as-avrodatafile	Imports data to Avro Data Files
--as-sequencefile	Imports data to SequenceFiles
--as-textfile	Imports data as plain text (default)
--as-parquetfile	Imports data to Parquet Files
--boundary-query <statement>	Boundary query to use for creating splits
--columns <col,col,col…>	Columns to import from table
--delete-target-dir	Delete the import target directory if it exists
--direct	Use direct connector if exists for the database
--fetch-size <n>	Number of entries to read from database at once.
--inline-lob-limit <n>	Set the maximum size for an inline LOB
-m,--num-mappers <n>	Use n map tasks to import in parallel
-e,--query <statement>	Import the results of statement.
--split-by <column-name>	Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--autoreset-to-one-mapper	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name>	Table to read
--target-dir <dir>	HDFS destination dir
--warehouse-dir <dir>	HDFS parent for table destination
--where <where clause>	WHERE clause to use during import
-z,--compress	Enable compression
--compression-codec <c>	Use Hadoop codec (default gzip)
--null-string <null-string>	The string to be written for a null value for string columns
--null-non-string <null-string>	The string to be written for a null value for non-string columns

Example 10: using argument --append

it append's data to an existing dataset in HDFS(like new insertions on top of existing data)

mano@Mano:~$ sqoop import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --append

Note :So, it will create separate file for new data using append on existing import

Example 11: using arguments for file formats by default --as-textfile, --as-avrodatafile, --as-sequencefile, --as-parquetfile

using default --as-textfile
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --append -m 1;

using --as-avrodatafile

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --append -m 1 --as-avrodatafile;
using --as-sequencefile

mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --append -m 1 --as-sequencefile;

using --as-parquetfile

mano@Mano:~$ hadosqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --append -m 1 --as-parquetfile

Example 12: Using argument --boundary-query and --split-by to define the parallel import based on split column

Example:

BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `employees`

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --boundary-query 'select min(id),max(id) from employees' --split-by id --delete-target-dir

Example 13:Using argument --columns

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --columns id,name -m1 --delete-target-dir

Output: Data on HDFS after import from DB
mano@Mano:~$ hadoop fs -cat /user/mano/employees/par*
17/09/02 11:37:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,Mano
2,Prasath
3,Chella
4,Govind

Example 14: using arguments --autoreset-to-one-mapper, --delete-target-dir

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --autoreset-to-one-mapper --delete-target-dir

Example 15: Using argument --query ,--target-dir, --where clause

Note:
While using --query argument must specify the target directory, else it fails

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --query 'select s.id,e.name from employees e join students s on e.id=s.id where $CONDITIONS' --target-dir '/MANO/Sqoop_import_tabl'e --split-by id --delete-target-dir

Example 16: Using argument --warehouse-dir , it will defines a parent directory for import but not the --target-dir

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees --warehouse-dir '/MANO/Sqoop_import_tabl'e --split-by id --delete-target-dir

Compressing Imported Data:

Before proceeding to read compression, make sure you have installed the codec on hadoop

If not please, use refer link ==> https://my-learnings-abouthadoop.blogspot.in/2017/09/snappy-installation-on-ubuntu.html

We need to use –compress option along with –compression-code with codec classname.

Below are the classname of compressions:

org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.SnappyCodec

i)Gzip Compression:

Classname:org.apache.hadoop.io.compress.GzipCodec

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --compress --compression-codec org.apache.hadoop.io.compress.GzipCodec --delete-target-dir

ii)DefaultCodec:
Classname:org.apache.hadoop.io.compress.DefaultCodec

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --compress --compression-codec org.apache.hadoop.io.compress.DefaultCodec --delete-target-dir

iii)BZip2 Compression:
Classname:org.apache.hadoop.io.compress.BZip2Codec

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --compress --compression-codec org.apache.hadoop.io.compress.BZip2Codec --delete-target-dir

iv)Snappy Compression:

Classname: org.apache.hadoop.io.compress.SnappyCodec

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table employees -m 1 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir

Import control arguments status table:

Argument	Description	Example
--append	Append data to an existing dataset in HDFS	Done
--as-avrodatafile	Imports data to Avro Data Files	Done
--as-sequencefile	Imports data to SequenceFiles	Done
--as-textfile	Imports data as plain text (default)	Done
--as-parquetfile	Imports data to Parquet Files	Done
--boundary-query <statement>	Boundary query to use for creating splits	Done
--columns <col,col,col…>	Columns to import from table	Done
--delete-target-dir	Delete the import target directory if it exists	Done
--direct	Use direct connector if exists for the database	Need to check
--fetch-size <n>	Number of entries to read from database at once.	Need to check
--inline-lob-limit <n>	Set the maximum size for an inline LOB	Need to check
-m,--num-mappers <n>	Use n map tasks to import in parallel	Done
-e,--query <statement>	Import the results of statement.	Done
--split-by <column-name>	Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.	Done
--autoreset-to-one-mapper	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.	Done
--table <table-name>	Table to read	Done
--target-dir <dir>	HDFS destination dir	Done
--warehouse-dir <dir>	HDFS parent for table destination	Done
--where <where clause>	WHERE clause to use during import	Done
-z,--compress	Enable compression	Done
--compression-codec <c>	Use Hadoop codec (default gzip)	Done
--null-string <null-string>	The string to be written for a null value for string columns	Need to check
--null-non-string <null-string>	The string to be written for a null value for non-string columns	Need to check

Note:
The --null-string and --null-non-string arguments are optional. If not specified, then the string "null" will be used.

4)Parameters for overriding mapping:
Sqoop is pre-configured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable in all the cases.

overridden by
--map-column-java = for changing mapping to Java
--map-column-hive = for changing Hive mapping

Argument	Description
--map-column-java <mapping>	Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping>	Override mapping from SQL to Hive type for configured columns.

Syntax:
--map-column-java <name of column>=<new type>, <name of column>=<new type>..etc.,
i)--map-column-java argument

Command:
mano@Mano:~$ sqoop-import --connect jdbc:mysql://localhost/sqoop_test --username root --password root --table students --map-column-java id=Integer,name=String --autoreset-to-one-mapper

Parameters for overriding mapping status table:

Argument	Description	Example
--map-column-java <mapping>	Override mapping from SQL to Java type for configured columns.	Done
--map-column-hive <mapping>	Override mapping from SQL to Hive type for configured columns.	Need to check

Please follow the link for further ==>Sqoop_Page 4

Share my learning's

Tuesday 5 September 2017

3)Sqoop Tool1: sqoop-import:

No comments:

Post a Comment

Fundamentals of Python programming