Reader small image

You're reading from  Apache Oozie Essentials

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781785880384
Edition1st Edition
Languages
Right arrow
Author (1)
Jagat Singh
Jagat Singh
author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh

Right arrow

Chapter 7. Running Sqoop Jobs

In this chapter, we will see how to run the Sqoop jobs from Oozie. Sqoop (SQL to Hadoop) is used to import and export data from different database systems on to the Hadoop platform.

In this chapter, we will:

  • Run Sqoop jobs from the command line

  • Create Oozie Workflow for Sqoop actions

  • Run Sqoop jobs from Coordinators

From the concept point of view, we will:

  • Understand the concept of HCatalog Datasets

  • Understand HCatalog Coordinator and EL functions

Chapter case study


Let's have a twist in the rainfall use case we solved in the previous chapter. Instead of getting CSV files for rainfall data, we need to import the rainfall data from MySQL database and then move on to processing.

As the first step of the analysis, we need to bring data inside Hadoop using Sqoop. To do this, we will use Sqoop import at end of each day to get data on Hadoop, and then we will run our Pig script for processing and saving results to Hive.

Just like previous chapters, we will start with the command-line option to trigger jobs, and we will learn about Sqoop action and scheduling it via Coordinator. Lastly, we will cover the concept of HCatalog Datasets. Let's get started.

Running Sqoop command line


The syntax for the Oozie Sqoop command-line execution is shown in the following screenshot:

Sqoop command line

Let's import all records for the table to HDFS.

Note

For sample MySQL database preparation, I have created one script in the folder <BOOK_CODE_HOME>/ch07/sqoop_commandline/loadToMySQL.sh, using which you can create one database to test the Sqoop import.

The database name is rainfall and table is rainfall_data. We can import all the records from this table using the Sqoop command-line import option. To create the test Dataset, execute the steps written in loadToMySQL.sh.

We are ready to run the job. I have saved the following command in the script <BOOK_CODE_HOME>/ch07/sqoop_commandline/import_all_records.sh:

oozie sqoop -oozie http://localhost:11000/oozie -command import --
connect jdbc:mysql://localhost:3306/rainfall --username root --
password "" --table rainfall_data --target-dir 
'/user/hue/learn_oozie/ch07/sqoop_commandline/rainfall/output' ...

Sqoop action


Sqoop action allows us to include the Sqoop commands as part of the broader Workflow, which can be part of data pipeline. All the parameters that Sqoop needs can be configured via XML arguments.

Open the Sqoop SVG diagram at <BOOK_CODE_HOME>/xsd_svg/sqoop-action-0.4 and see the different properties and elements required for Sqoop action to work.

Check out the following SVG:

Sqoop action SVG

Most of the elements required for Sqoop action are similar to the ones we have already seen. The main definition of Sqoop action can be done with one of the two options:

  • command

  • arg

An example of the command option is as follows:

<command>import --connect jdbc:mysql://localhost/database --username sqoop --password sqoop --table tablenameinDB --hive-import --hive-table tablnameinHive</command>

Here's an example of an arg option:

<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:mysql://localhost</arg>
<arg>--username</arg
<arg>...

HCatalog


HCatalog provides the table and storage management layer for Hadoop. It brings various tools in the Hadoop ecosystem together. Using HCatalog interface, different tools like Hive, Pig, and MapReduce can read and write data on Hadoop. All of them can use the shared schema and datatypes provided by HCatalog. Having shared the mechanism of reading and writing makes it easy to consume the output of one tool in the other one.

So how does HCatalog come in section of Datasets? So far, we have seen the HDFS folder-based Datasets in which based on some success flag, we come to know that data is available. Using HCatalog-based Datasets, we can trigger Oozie jobs based on time when data in a given Hive partition becomes available for consumption. This takes Oozie to the next level of job dependency, where we can consume data as and when it is available in Hive.

To quickly see an example of interoperability, let's see how Pig can use Hive tables and how HCatalog brings all tools together. Read...

Summary


This completes our chapter. We discussed the new concept of HCatalog and Oozie integration, which has been recently released. We also covered Sqoop action and used the concepts that we discussed in the previous chapters to make a Coordinator.

In the next chapter, we will see how to run Spark jobs from Oozie.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Oozie Essentials
Published in: Dec 2015Publisher: ISBN-13: 9781785880384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh