Packt+ | Advance your knowledge in tech

You're reading from Apache Oozie Essentials

Product typeBook

Published inDec 2015

Reading LevelIntermediate

Publisher

ISBN-139781785880384

Edition1st Edition

Languages

Java

Concepts

Data Processing

Author (1)

Jagat Singh

Chapter 7. Running Sqoop Jobs

In this chapter, we will see how to run the Sqoop jobs from Oozie. Sqoop (SQL to Hadoop) is used to import and export data from different database systems on to the Hadoop platform.

In this chapter, we will:

Run Sqoop jobs from the command line
Create Oozie Workflow for Sqoop actions
Run Sqoop jobs from Coordinators

From the concept point of view, we will:

Understand the concept of HCatalog Datasets
Understand HCatalog Coordinator and EL functions

Chapter case study

Let's have a twist in the rainfall use case we solved in the previous chapter. Instead of getting CSV files for rainfall data, we need to import the rainfall data from MySQL database and then move on to processing.

As the first step of the analysis, we need to bring data inside Hadoop using Sqoop. To do this, we will use Sqoop import at end of each day to get data on Hadoop, and then we will run our Pig script for processing and saving results to Hive.

Just like previous chapters, we will start with the command-line option to trigger jobs, and we will learn about Sqoop action and scheduling it via Coordinator. Lastly, we will cover the concept of HCatalog Datasets. Let's get started.

Running Sqoop command line

The syntax for the Oozie Sqoop command-line execution is shown in the following screenshot:

Sqoop command line

Let's import all records for the table to HDFS.

Note

For sample MySQL database preparation, I have created one script in the folder <BOOK_CODE_HOME>/ch07/sqoop_commandline/loadToMySQL.sh, using which you can create one database to test the Sqoop import.

The database name is rainfall and table is rainfall_data. We can import all the records from this table using the Sqoop command-line import option. To create the test Dataset, execute the steps written in loadToMySQL.sh.

We are ready to run the job. I have saved the following command in the script <BOOK_CODE_HOME>/ch07/sqoop_commandline/import_all_records.sh:

oozie sqoop -oozie http://localhost:11000/oozie -command import --
connect jdbc:mysql://localhost:3306/rainfall --username root --
password "" --table rainfall_data --target-dir 
'/user/hue/learn_oozie/ch07/sqoop_commandline/rainfall/output' ...

Sqoop action

Sqoop action allows us to include the Sqoop commands as part of the broader Workflow, which can be part of data pipeline. All the parameters that Sqoop needs can be configured via XML arguments.

Open the Sqoop SVG diagram at <BOOK_CODE_HOME>/xsd_svg/sqoop-action-0.4 and see the different properties and elements required for Sqoop action to work.

Check out the following SVG:

Sqoop action SVG

Most of the elements required for Sqoop action are similar to the ones we have already seen. The main definition of Sqoop action can be done with one of the two options:

command
arg

An example of the command option is as follows:

<command>import --connect jdbc:mysql://localhost/database --username sqoop --password sqoop --table tablenameinDB --hive-import --hive-table tablnameinHive</command>

Here's an example of an arg option:

<arg>import</arg>
<arg>--connect</arg>
<arg>jdbc:mysql://localhost</arg>
<arg>--username</arg
<arg>...

HCatalog

HCatalog provides the table and storage management layer for Hadoop. It brings various tools in the Hadoop ecosystem together. Using HCatalog interface, different tools like Hive, Pig, and MapReduce can read and write data on Hadoop. All of them can use the shared schema and datatypes provided by HCatalog. Having shared the mechanism of reading and writing makes it easy to consume the output of one tool in the other one.

So how does HCatalog come in section of Datasets? So far, we have seen the HDFS folder-based Datasets in which based on some success flag, we come to know that data is available. Using HCatalog-based Datasets, we can trigger Oozie jobs based on time when data in a given Hive partition becomes available for consumption. This takes Oozie to the next level of job dependency, where we can consume data as and when it is available in Hive.

To quickly see an example of interoperability, let's see how Pig can use Hive tables and how HCatalog brings all tools together. Read...

Summary

This completes our chapter. We discussed the new concept of HCatalog and Oozie integration, which has been recently released. We also covered Sqoop action and used the concepts that we discussed in the previous chapters to make a Coordinator.

In the next chapter, we will see how to run Spark jobs from Oozie.

The rest of the chapter is locked

You have been reading a chapter from

Apache Oozie Essentials

Published in: Dec 2015Publisher: ISBN-13: 9781785880384

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages