Packt+ | Advance your knowledge in tech

You're reading from Apache Oozie Essentials

Product typeBook

Published inDec 2015

Reading LevelIntermediate

Publisher

ISBN-139781785880384

Edition1st Edition

Languages

Java

Concepts

Data Processing

Author (1)

Jagat Singh

Chapter 4. Running MapReduce Jobs

In this chapter, we will learn how to run MapReduce jobs using Oozie. MapReduce jobs are of two types: Java MapReduce jobs and Streaming jobs. Streaming jobs are written in languages other than Java. We will also enter in to the world of when part of Workflow execution using Coordinators to schedule our jobs.

In this chapter, we will do the following:

Run Java MapReduce jobs from Oozie
Run Streaming jobs from Oozie
Run Coordinator jobs

From the concept point of view, we will:

Understand the concept of Coordinators
Understand the concept of cron-based frequency schedules
Understand the importance of timezone in Oozie
Understand the concept of Datasets

Chapter case study

The customer for whom we work also keeps track of what its competitors are doing. They keep a close eye on all the press releases, job postings, and public interactions of competitors. Information about competitors from various sources is captured in text format and fed to the Hadoop system. Every weekend, analysis is done to see trending topics and words, which are used by competitors to guess about the areas they are working or investing in.

The preceding paragraph is an example of first-level text analytics problem in Big Data space. To solve this problem, we will run classic word count using MapReduce. We will use it for word count each time a given word appears in all of the documents.

Running MapReduce jobs from Oozie

We will see how to write a simple MapReduce job for word count and schedule it via Oozie. Later, we will wrap this in our first Coordinator job. Along this journey, we will learn some concepts and apply them in examples.

I have already saved one word count Java MapReduce code, which we will try to run over our input data. Let's dive into the code. You can check out the mapreduce folder in Book_Code_Folder/learn_oozie/ch04/.

Note

Check the workflow_0.5.xsd file in the xsd_svg folder and note the inputs needed for the MapReduce action to run.

The Workflow is shown in the following code and we can see the arguments are the same as the one we need in the Hadoop jar command for running a MapReduce job. At the start of the job, we delete the output folder as Hadoop fails the job if the output folder already exists.

The mapper that we need is life.jugnu.learnoozie.ch04.WordCountMapper and the reducer is life.jugnu.learnoozie.ch04.WordCountReducer. Both of them are present...

Running Oozie MapReduce job

Oozie has a command-line functionality to submit a job, which has just a MapReduce action. The command-line option that we saw in the previous action can be used anywhere when we have a Workflow or Coordinator with complex DAG.

To run Oozie job, which is just a simple MapReduce, we can use the command options shown in the following screenshot:

Oozie MapReduce command line

Here's an example:

oozie mapreduce -config job.properties -oozie http://localhost:11000/oozie

Tip

We can also choose to pass on variables such as input and output from the command line.

In this section, we made our Workflow using the MapReduce action and used the command-line Oozie job option with the job.properties file to run the same.

Let's move on to the next topic of Coordinators.

Coordinators

Coordinators allow us to run interdependent Workflows as data pipelines based on some starting criteria. They decide the when part of execution of Oozie job. Most of the Oozie jobs are triggered at a given scheduled time interval or when input Dataset is present for triggering the job. Here are a few important definitions related to Coordinators:

Nominal time: This the scheduled time at which job should execute. For example, we process press release every day at 8:00 P.M.
Actual time: This is the real time when the job runs. In some cases, if the input data does not arrive, the job might start late. This type of data-dependent job triggering is indicated by the <done-flag> tag (more on this later). The done-flag gives a signal to start the job execution.

The general skeleton template of Coordinator is shown in the following figure named Coordinator template XML:

Coordinator template XML

The <parameters> tag on line 2 in the preceding screenshot are any variables defined...

My first Coordinator

In this section, we will write the scheduled job for running out the MapReduce Workflow. Let's start with a simple Coordinator declaration. The code for the following example is present in the folder BOOK_CODE_HOME/learn_oozie/ch04/mapreduce_coordinat or/v1.

Coordinator v1 definition

The Coordinator definition present in the coordinator.xml is as follows:

<coordinator-app name="My_First_Coordinator" frequency="${frequency}" start="${start_date}" end="${end_date}" timezone="Australia/Sydney" xmlns="uri:oozie:coordinator:0.4">
  <action>
    <workflow>
      <app-path>${wf_application_path}</app-path>
   </workflow>
  </action>
</coordinator-app>

The Coordinator definition is simple. It says, "Run the Workflow wf_application_path with the given arguments start_date, end_date, and fre quency."

job.properties v1 definition

Look at the values for variables declared in workflow.xml. We will define them in the job.properties file:

#...

Running a MapReduce streaming job

In this section we will learn how to run Hadoop Streaming jobs using Oozie. Hadoop Streaming gives the functionality to use different languages such as Python, C++, and Ruby to write MapReduce code.

Note

Read the Oozie documentation at https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.2.2_Map-Reduce_Action and write a Workflow to run a Streaming job. Schedule the same using Coordinator. You can refer to the sample Python mapper and reducer code available at http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/.

Save the Python code from the preceding web links as mapper.py and reducer.py in the streaming folder.

The <mapper> tag makes our mapper and reducer file available to Oozie.

The Workflow looks like this:

<workflow-app name="Mapreduce_Streaming_example" xmlns="uri:oozie:workflow:0.5">
  <start to="streaming-c097"/>
    <kill name="Kill">
      <message>Action failed, error message...

Summary

In this chapter, we saw how to run Java MapReduce jobs as part of the Oozie Workflow. We discussed the concept of Coordinators and scheduled the job using the same. We also covered datasets, frequency specification, and cron-based schedules.

In the next chapter, we will see how to run Hive jobs from Oozie. We will continue to build our Coordinator concepts.

The rest of the chapter is locked

You have been reading a chapter from

Apache Oozie Essentials

Published in: Dec 2015Publisher: ISBN-13: 9781785880384

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jagat Singh

Contacted on 12/01/18 by Davis Anto
Read more about Jagat Singh

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages