Working with Import Process (Intermediate)

Exclusive offer: get 50% off this eBook here
Instant Apache Sqoop [Instant]

Instant Apache Sqoop [Instant] — Save 50%

Transfer data efficiently between RDBMS and the Hadoop ecosystem using the robust Apache Sqoop with this book and ebook

$14.99    $7.50
by Ankit Jain | August 2013 | Open Source

This article created by Ankit Jain the author of Instant Apache Sqoop, will explain some common arguments of import process and how we can use these arguments to import RDBMS table into HDFS.

(For more resources related to this topic, see here.)

Getting ready

The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME.

Given here are the prerequisites for Sqoop import process.

  • Installed and running Relational Database Management System (MySQL).

  • Installed and running Hadoop Cluster.

  • Set $HADOOP_HOME environment variable.

Following are the common arguments of import process.

Parameters

Description

--connect <jdbc-uri>

This command specifies the server or database to connect. It also specifies the port.

Example:

--connect jdbc:mysql://host:port/databaseName

--connection-manager <class-name>

 

Specify connection manager class name.

 

--driver <class-name>

Specify the fully qualified name of JDBC driver class.

--password <password>

Set authentication password required to connect to input source.

--username <username>

Set authentication username.

How to do it

Let’s see how to work with import process

First, we will start with import single RDBMS table into Hadoop.

Query1:

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table tableName
--target-dir /user/abc/tableName

The content of output file in HDFS will look like:

Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop.

Query 2: Import selected columns

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --columns “student_id,address,name”

Query 3: Import selected rows.

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --where ‘student_id<100’

Query 4: Import selected columns of selected rows.

$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --columns “student_id,address,name”
-- where ‘student_id<100’

How it works…

Now let’s see how the above steps work:

Import single table:

Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System.

Parameters

Description

--table <table-name>

Name of input table to fetch.

--target-dir<dir>

Location of output/target dir in HDFS.

--direct

If user want to use non-JDBC based access mechanism for faster database access

--options-file <file-path>

All the command line options that are common in most of commands can put in options file for convenience. 

The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like:

A1, B1, C1, D1 A2, B2, C2, D2

Import selected columns:

By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument.

The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like:

student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR ..........

If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like.

address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 .............

Import selected rows:

By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement.

The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100.

The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”.

There’s more...

This section covers some more examples of import process.

Import all tables:

So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool:

All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted.

Example:

Query 5:

$ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1
--username root --password password

This query will import all tables (tableName and tableName1) of database db1 into HDFS.

Output directories in HDFS look like:

Summary

We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables.

Resources for Article :


Further resources on this subject:


Instant Apache Sqoop [Instant] Transfer data efficiently between RDBMS and the Hadoop ecosystem using the robust Apache Sqoop with this book and ebook
Published: August 2013
eBook Price: $14.99
See more
Select your format and quantity:

About the Author :


Ankit Jain

Ankit Jain is a software professional with over two years of experience in implementing, designing, and managing Big Data solutions for industry leaders. His core skills include Hadoop, HBase, Hive, Sqoop, Flume, Elasticsearch, Machine Learning, Kafka, Storm, Java, and J2EE. He is currently employed with Impetus Infotech Pvt Ltd.

He is an active blogger and can be followed at http://ankitasblogger.blogspot.in/.

Books From Packt


Apache Solr 4 Cookbook
Apache Solr 4 Cookbook

 Instant Apache Solr for Indexing Data How-to [Instant]
Instant Apache Solr for Indexing Data How-to [Instant]

 Apache Maven 3 Cookbook
Apache Maven 3 Cookbook

Apache CloudStack Cloud Computing
Apache CloudStack Cloud Computing

Apache Flume: Distributed Log Collection for Hadoop
Apache Flume: Distributed Log Collection for Hadoop

 Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

 Instant Apache ServiceMix How-to [Instant]
Instant Apache ServiceMix How-to [Instant]

 Instant Apache Stanbol [Instant]
Instant Apache Stanbol [Instant]


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
L
3
b
L
F
K
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software