Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Oozie Essentials

You're reading from  Apache Oozie Essentials

Product type Book
Published in Dec 2015
Publisher
ISBN-13 9781785880384
Pages 164 pages
Edition 1st Edition
Languages
Author (1):
Jagat Singh Jagat Singh
Profile icon Jagat Singh

Chapter 2. My First Oozie Job

In this chapter, we will dive in the world of Oozie by running our first Oozie job. We will also set up Hue, which will allow us to edit Oozie Workflows from a graphical user interface. We will be using the Hortonworks VirtualBox machine to do all our projects throughout the book.

In this chapter, we will do the following:

  • Install and configure Hue Oozie Workflow editor

  • Run our first Oozie Workflow job

  • Understand the concept of Workflow, Coordinator, and Bundles

  • Understand Oozie Fs actions

  • Use Oozie console to see the job status

  • Use the Oozie command line to get the job status

Installing and configuring Hue


The Hortonworks virtual machine already has one version of Hue running, but that is very old. We will install the latest version of Hue ourselves since it has a better Oozie editor.

Start the virtual machine. Once the machine is up and running, we can log in to that via SSH using the following command:

$ ssh root@127.0.0.1 -p 2222

The default password is hadoop.

Let's download and configure Hue. Here are the steps to do so:

  1. Download the latest release of Hue.

  2. Install the dependencies required to build Hue via yum.

  3. Build the Hue package using the make command.

  4. Before you execute the following commands, check the Hue website (http://gethue.com/category/release/) and find out the latest version of Hue. I have used 3.8.1 in this book. But I suggest you to download the latest one. The only change needed in the following is to change the version 3.8.1 to whatever latest version is present:

    $ mkdir -p /opt/learn_oozie/hue
    $ chmod 777 /opt/learn_oozie/hue
    $ chown hue:hue...

Oozie concepts


Before we move further, let's look at a few basic concepts of Oozie. In each chapter, we will take some time to learn some new concepts of Oozie besides looking at working examples.

Workflows

Workflow tells Oozie what to do. They are the DAG (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions (tasks). It is a collection of actions arranged in required dependency graph. As a part of Workflow's definition, we write some actions and call them in a certain order.

These are of various types for tasks that we can do as a part of the Workflow, for example, Fs (Hadoop filesystem) action, Pig action, Hive action, MapReduce action, Spark action, and so on. We will discuss Fs action in this chapter.

Coordinator

Coordinator tells Oozie when to do a task, for example, when is the component in Oozie world decided by time or when is the given input data set available. We will discuss the Coordinators later in this book.

Bundles

Bundles tell Oozie what all things to...

Book case study


Throughout this book, we will try to solve case study that will revolve around various concepts of Oozie.

One of the main use cases of Hadoop is ETL data processing.

Suppose we work for a large consulting company and have won a project to set up a Big Data cluster inside the customer data center. On a high level, the requirements are to set up an environment that will satisfy the following flow:

  1. Get data from various sources in Hadoop (file-based loads and Sqoop-based loads).

  2. Preprocess them with various scripts (Pig, Hive, and MapReduce).

  3. Insert that data into Hive tables for use by analysts and data scientists.

  4. Data scientists then write machine learning models (Spark).

We will use Oozie as our processing scheduling system to do all the preceding tasks. Since writing actual Hive, Sqoop, MapReduce, Pig, and Spark code is not in the scope of this book, I will not dive into explaining business logic for those. So I have kept them very simple.

In our architecture, we have one landing...

Running our first Oozie job


We will start with a very simple example. In this chapter, our use case is to delete a given folder on HDFS via Oozie. In our case study, we get data daily in one folder in HDFS, but we want to delete the previous day's data. We want to keep just latest version in our system. Let's solve our business problem:

  1. Log in to Hue and go to Workflows | Editor.

  2. In the top row of editor, there are various types of actions. Select the Hadoop Fs action.

    Tip

    Take some time with your mouse over and read the names of various types of actions that Oozie can run.

  3. Drag the Hadoop Fs action to the editor as shown in the next screenshot.

  4. Give a meaningful name to this action, for example, my_delete_folder_action.

  5. Give the path of the folder that you want to delete. I have used /user/hue/learn_oozie/my_first_oozie_job. I have also set the name of the Workflow as My First Oozie Job, as shown in the following screenshot:

    Hue Workflow editor

  6. Make these changes and click on Save for the Workflow...

Types of nodes


Workflow is composed of nodes; the logical DAG of nodes represents what part of the work is done by Oozie. Each node does a specified work and on success moves to one node or moves to another node on failure. For example, on success it goes to the OK node and on failure it goes to the Kill node.

Nodes in the Oozie Workflow are of the following types:

  • Control flow nodes

  • Action nodes

Let's discuss them in detail.

Control flow nodes

These nodes are responsible for defining start, end, and control flow of what to do inside the Workflow. These can be one of following:

  • Start node

  • End node

  • Kill node

  • Decision node

  • Fork and Join node

You have already seen the examples of the Start, End, and Kill nodes. In the context of programming, we can say that Decision nodes represent the switch or if else conditions. Fork and Join nodes represent the parallel branches of code.

Let's see a sample syntax for Decision and Fork/Join nodes next.

Here's the general syntax for a Decision node:

<workflow-app name...

Oozie web console


Oozie web console is a web-based tool that gives a read-only view about the jobs.

In your web browser, open the URL http://127.0.0.1:11000/oozie, as shown in the following screenshot:

Oozie web console

At the top of the screen, we have following tabs:

  • Workflows

  • Coordinators

  • Bundles

  • System Info

  • Instrumentation

  • Settings

Click on our job ID My First Oozie Job; you can see we have many other jobs also run. You will have a different view. Click on your job and see that Oozie has divided the jobs as per tasks in the Workflow. Start the Fs action and end were the steps for the Workflow, so each of them is represented in the log.

Click on the last tab that says Job DAG. This shows the flow of the job. Since our job was simple, DAG is just a linear flow. In future jobs, we will see more complex DAG.

The important use of the console is when our job fails. Let's see an example of a job that has not completed successfully. We can click on the required action to see the logs and detailed...

The Oozie command line


In the last section of this chapter, we will see how to view the status of our job via the command line. We have already seen one way of checking job status via the Oozie web console.

Start a SSH session to the virtual machine and use the following command:

$ oozie job -info 0000007-150727083427440-oozie-oozi-W --oozie http://sandbox.hortonworks.com:11000/oozie

The general syntax is as follows:

$ oozie job -info <job_id> --oozie <oozie_server_url>

The following screenshot shows the output of the preceding command:

Oozie job info

Note

Explore the output of the Oozie help job. Note the various options and commands that we can execute on a given job.

Summary


In this chapter, we saw how to run a simple Oozie job from Hue console. We discussed concepts of Workflow in detail and saw how to use Fs action. We also checked the job logs using web console and submitted the job using the command line.

In the next chapter, we will see how to submit a job without using Hue. We will discuss how to use the Oozie command-line tool to submit a job and get an idea about the job.properties file. We will also look at Control nodes, Fork, and Join in detail.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Apache Oozie Essentials
Published in: Dec 2015 Publisher: ISBN-13: 9781785880384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}