Reader small image

You're reading from  Apache Oozie Essentials

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781785880384
Edition1st Edition
Languages
Right arrow
Author (1)
Jagat Singh
Jagat Singh
author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh

Right arrow

Chapter 8. Running Spark Jobs

In this chapter, we will see how to run Spark jobs from Oozie. Spark has changed the whole ecosystem of Hadoop and the Big Data world. It can be used as ETL tool or machine learning tool, and it can be used where traditionally we use Pig, Hive, or Sqoop.

In this chapter, we will:

  • Create Oozie Workflow for Spark actions

From the concept point of view, we will:

  • Understand the concept of Bundles

We will start off with a simple Workflow in which we will rewrite the same Pig logic of finding maximum rainfall in a given month in Spark and then we will schedule that using Oozie Workflow and Coordinators. The idea is to show the beauty of Spark—how seamlessly it replaces various tools such as Pig or Hive, and how it has become the default execution engine of the Big Data platform. If you are a very keen follower of Hadoop news, recently Cloudera announced that they are declaring phase out of MapReduce and are going to keep all their eggs in the Spark bucket. The vast number...

Spark action


The Spark action has been recently added in Oozie and the general XSD is shown in the following figure:

Spark SVG action

The general schema is as follows:

<action>
  <job-tracker>        // Job tracker details
  <name-node>          // Name node details
  <prepare>            // Create or Delete directory
  <job-xml>            // Any job xml properties
  <configuration>      // Hadoop job configuration
  <master>             // Spark master details
  <mode>               // Spark driver mode
  <name>               // Spark Job name
  <class>              // Spark main class
  <spark-opts>         // Spark Job options
  <arg>                // Arguments for the job
</action>

The <master> element tells about the URL of Spark master. Spark can run in different cluster configurations, namely Spark standalone, Mesos, and Yarn. Depending on which cluster manager you are using, the master URL will change...

Bundles


So far, you've learned about Workflows (what to do) and Coordinators (when to do) in Oozie.

Now we will cover Bundles. Bundles are a group of Coordinators that are grouped together and managed all as one bundle. This makes it easy to operate set of Coordinators to start, stop, and resume the jobs.

The basic SVG diagram for the Bundles is shown here:

Bundles specification

Bundle needs to have information about the set of Coordinators for which it is responsible and the kick-off time. Kick-off time is the time at which Bundle should start and submit all the applications to the Oozie server. The Coordinators which are a part of a Bundle may or may not have a relationship between them. They can be part of the same or different data pipelines. Generally, the best practice is to bundle all tables that are coming from the same database, or bundle all Coordinators that are part of same data pipeline.

You might want to check out the pictorial representation of Bundle's job flow on this blog:

http...

Data pipelines


In real Big Data projects, the Coordinators are scheduled tasks that are part of the data pipeline. For example, get data from some system and process it (this forms one Coordinator), and then another sub process can send the processed data to a database (this forms another Coordinator). Finally, both of them are abstracted to form Bundle. To think in terms of how to solve your job using Oozie, start by drawing the job Workflow on a whiteboard/paper. Then discuss with your team how you can create unit abstractions to run individually and in isolation.

Check out the following example.

The database has a record of daily rainfall in Melbourne. We import that data to Hadoop using a regular Coordinator job (Coordinator 1). Using another scheduled job, we send the results back to the database as shown in the following figure:

Data pipelines

Note

Exercise: Take the preceding example and make one Bundle that processes our rainfall data in the first Coordinator (using Pig script) and sends...

Summary


In this chapter, we saw how to run Apache Spark jobs from Oozie. Then, we discussed how to think in terms of data pipelines and finally discussed Bundles.

In the next chapter, we will talk about various production-related concepts and day-to-day tasks that are helpful while running Oozie.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Oozie Essentials
Published in: Dec 2015Publisher: ISBN-13: 9781785880384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh