Reader small image

You're reading from  Apache Oozie Essentials

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781785880384
Edition1st Edition
Languages
Right arrow
Author (1)
Jagat Singh
Jagat Singh
author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh

Right arrow

Chapter 4. Running MapReduce Jobs

In this chapter, we will learn how to run MapReduce jobs using Oozie. MapReduce jobs are of two types: Java MapReduce jobs and Streaming jobs. Streaming jobs are written in languages other than Java. We will also enter in to the world of when part of Workflow execution using Coordinators to schedule our jobs.

In this chapter, we will do the following:

  • Run Java MapReduce jobs from Oozie

  • Run Streaming jobs from Oozie

  • Run Coordinator jobs

From the concept point of view, we will:

  • Understand the concept of Coordinators

  • Understand the concept of cron-based frequency schedules

  • Understand the importance of timezone in Oozie

  • Understand the concept of Datasets

Chapter case study


The customer for whom we work also keeps track of what its competitors are doing. They keep a close eye on all the press releases, job postings, and public interactions of competitors. Information about competitors from various sources is captured in text format and fed to the Hadoop system. Every weekend, analysis is done to see trending topics and words, which are used by competitors to guess about the areas they are working or investing in.

The preceding paragraph is an example of first-level text analytics problem in Big Data space. To solve this problem, we will run classic word count using MapReduce. We will use it for word count each time a given word appears in all of the documents.

Running MapReduce jobs from Oozie


We will see how to write a simple MapReduce job for word count and schedule it via Oozie. Later, we will wrap this in our first Coordinator job. Along this journey, we will learn some concepts and apply them in examples.

I have already saved one word count Java MapReduce code, which we will try to run over our input data. Let's dive into the code. You can check out the mapreduce folder in Book_Code_Folder/learn_oozie/ch04/.

Note

Check the workflow_0.5.xsd file in the xsd_svg folder and note the inputs needed for the MapReduce action to run.

The Workflow is shown in the following code and we can see the arguments are the same as the one we need in the Hadoop jar command for running a MapReduce job. At the start of the job, we delete the output folder as Hadoop fails the job if the output folder already exists.

The mapper that we need is life.jugnu.learnoozie.ch04.WordCountMapper and the reducer is life.jugnu.learnoozie.ch04.WordCountReducer. Both of them are present...

Running Oozie MapReduce job


Oozie has a command-line functionality to submit a job, which has just a MapReduce action. The command-line option that we saw in the previous action can be used anywhere when we have a Workflow or Coordinator with complex DAG.

To run Oozie job, which is just a simple MapReduce, we can use the command options shown in the following screenshot:

Oozie MapReduce command line

Here's an example:

oozie mapreduce -config job.properties -oozie http://localhost:11000/oozie

Tip

We can also choose to pass on variables such as input and output from the command line.

In this section, we made our Workflow using the MapReduce action and used the command-line Oozie job option with the job.properties file to run the same.

Let's move on to the next topic of Coordinators.

Coordinators


Coordinators allow us to run interdependent Workflows as data pipelines based on some starting criteria. They decide the when part of execution of Oozie job. Most of the Oozie jobs are triggered at a given scheduled time interval or when input Dataset is present for triggering the job. Here are a few important definitions related to Coordinators:

  • Nominal time: This the scheduled time at which job should execute. For example, we process press release every day at 8:00 P.M.

  • Actual time: This is the real time when the job runs. In some cases, if the input data does not arrive, the job might start late. This type of data-dependent job triggering is indicated by the <done-flag> tag (more on this later). The done-flag gives a signal to start the job execution.

The general skeleton template of Coordinator is shown in the following figure named Coordinator template XML:

Coordinator template XML

The <parameters> tag on line 2 in the preceding screenshot are any variables defined...

My first Coordinator


In this section, we will write the scheduled job for running out the MapReduce Workflow. Let's start with a simple Coordinator declaration. The code for the following example is present in the folder BOOK_CODE_HOME/learn_oozie/ch04/mapreduce_coordinat or/v1.

Coordinator v1 definition

The Coordinator definition present in the coordinator.xml is as follows:

<coordinator-app name="My_First_Coordinator" frequency="${frequency}" start="${start_date}" end="${end_date}" timezone="Australia/Sydney" xmlns="uri:oozie:coordinator:0.4">
  <action>
    <workflow>
      <app-path>${wf_application_path}</app-path>
   </workflow>
  </action>
</coordinator-app>

The Coordinator definition is simple. It says, "Run the Workflow wf_application_path with the given arguments start_date, end_date, and fre quency."

job.properties v1 definition

Look at the values for variables declared in workflow.xml. We will define them in the job.properties file:

#...

Running a MapReduce streaming job


In this section we will learn how to run Hadoop Streaming jobs using Oozie. Hadoop Streaming gives the functionality to use different languages such as Python, C++, and Ruby to write MapReduce code.

Note

Read the Oozie documentation at https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.2.2_Map-Reduce_Action and write a Workflow to run a Streaming job. Schedule the same using Coordinator. You can refer to the sample Python mapper and reducer code available at http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/.

Save the Python code from the preceding web links as mapper.py and reducer.py in the streaming folder.

The <mapper> tag makes our mapper and reducer file available to Oozie.

The Workflow looks like this:

<workflow-app name="Mapreduce_Streaming_example" xmlns="uri:oozie:workflow:0.5">
  <start to="streaming-c097"/>
    <kill name="Kill">
      <message>Action failed, error message...

Summary


In this chapter, we saw how to run Java MapReduce jobs as part of the Oozie Workflow. We discussed the concept of Coordinators and scheduled the job using the same. We also covered datasets, frequency specification, and cron-based schedules.

In the next chapter, we will see how to run Hive jobs from Oozie. We will continue to build our Coordinator concepts.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Oozie Essentials
Published in: Dec 2015Publisher: ISBN-13: 9781785880384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh