Chapter 2. My First Oozie Job
In this chapter, we will dive in the world of Oozie by running our first Oozie job. We will also set up Hue, which will allow us to edit Oozie Workflows from a graphical user interface. We will be using the Hortonworks VirtualBox machine to do all our projects throughout the book.
In this chapter, we will do the following:
Install and configure Hue Oozie Workflow editor
Run our first Oozie Workflow job
Understand the concept of Workflow, Coordinator, and Bundles
Understand Oozie Fs actions
Use Oozie console to see the job status
Use the Oozie command line to get the job status
Installing and configuring Hue
The Hortonworks virtual machine already has one version of Hue running, but that is very old. We will install the latest version of Hue ourselves since it has a better Oozie editor.
Start the virtual machine. Once the machine is up and running, we can log in to that via SSH using the following command:
The default password is hadoop
.
Let's download and configure Hue. Here are the steps to do so:
Download the latest release of Hue.
Install the dependencies required to build Hue via yum.
Build the Hue package using the make
command.
Before you execute the following commands, check the Hue website (http://gethue.com/category/release/) and find out the latest version of Hue. I have used 3.8.1 in this book. But I suggest you to download the latest one. The only change needed in the following is to change the version 3.8.1 to whatever latest version is present:
Before we move further, let's look at a few basic concepts of Oozie. In each chapter, we will take some time to learn some new concepts of Oozie besides looking at working examples.
Workflow tells Oozie what to do. They are the DAG (https://en.wikipedia.org/wiki/Directed_acyclic_graph) representation of actions (tasks). It is a collection of actions arranged in required dependency graph. As a part of Workflow's definition, we write some actions and call them in a certain order.
These are of various types for tasks that we can do as a part of the Workflow, for example, Fs (Hadoop filesystem) action, Pig action, Hive action, MapReduce action, Spark action, and so on. We will discuss Fs action in this chapter.
Coordinator tells Oozie when to do a task, for example, when is the component in Oozie world decided by time or when is the given input data set available. We will discuss the Coordinators later in this book.
Bundles tell Oozie what all things to...
Throughout this book, we will try to solve case study that will revolve around various concepts of Oozie.
One of the main use cases of Hadoop is ETL data processing.
Suppose we work for a large consulting company and have won a project to set up a Big Data cluster inside the customer data center. On a high level, the requirements are to set up an environment that will satisfy the following flow:
Get data from various sources in Hadoop (file-based loads and Sqoop-based loads).
Preprocess them with various scripts (Pig, Hive, and MapReduce).
Insert that data into Hive tables for use by analysts and data scientists.
Data scientists then write machine learning models (Spark).
We will use Oozie as our processing scheduling system to do all the preceding tasks. Since writing actual Hive, Sqoop, MapReduce, Pig, and Spark code is not in the scope of this book, I will not dive into explaining business logic for those. So I have kept them very simple.
In our architecture, we have one landing...
Running our first Oozie job
We will start with a very simple example. In this chapter, our use case is to delete a given folder on HDFS via Oozie. In our case study, we get data daily in one folder in HDFS, but we want to delete the previous day's data. We want to keep just latest version in our system. Let's solve our business problem:
Log in to Hue and go to Workflows | Editor.
In the top row of editor, there are various types of actions. Select the Hadoop Fs action.
Tip
Take some time with your mouse over and read the names of various types of actions that Oozie can run.
Drag the Hadoop Fs action to the editor as shown in the next screenshot.
Give a meaningful name to this action, for example, my_delete_folder_action
.
Give the path of the folder that you want to delete. I have used /user/hue/learn_oozie/my_first_oozie_job
. I have also set the name of the Workflow as My First Oozie Job
, as shown in the following screenshot:
Make these changes and click on Save for the Workflow...
Workflow is composed of nodes; the logical DAG of nodes represents what part of the work is done by Oozie. Each node does a specified work and on success moves to one node or moves to another node on failure. For example, on success it goes to the OK node and on failure it goes to the Kill node.
Nodes in the Oozie Workflow are of the following types:
Control flow nodes
Action nodes
Let's discuss them in detail.
These nodes are responsible for defining start, end, and control flow of what to do inside the Workflow. These can be one of following:
Start node
End node
Kill node
Decision node
Fork and Join node
You have already seen the examples of the Start, End, and Kill nodes. In the context of programming, we can say that Decision nodes represent the switch
or if else
conditions. Fork and Join nodes represent the parallel branches of code.
Let's see a sample syntax for Decision and Fork/Join nodes next.
Here's the general syntax for a Decision node:
Oozie web console is a web-based tool that gives a read-only view about the jobs.
In your web browser, open the URL http://127.0.0.1:11000/oozie
, as shown in the following screenshot:
At the top of the screen, we have following tabs:
Workflows
Coordinators
Bundles
System Info
Instrumentation
Settings
Click on our job ID My First Oozie Job
; you can see we have many other jobs also run. You will have a different view. Click on your job and see that Oozie has divided the jobs as per tasks in the Workflow. Start the Fs action and end were the steps for the Workflow, so each of them is represented in the log.
Click on the last tab that says Job DAG. This shows the flow of the job. Since our job was simple, DAG is just a linear flow. In future jobs, we will see more complex DAG.
The important use of the console is when our job fails. Let's see an example of a job that has not completed successfully. We can click on the required action to see the logs and detailed...
In the last section of this chapter, we will see how to view the status of our job via the command line. We have already seen one way of checking job status via the Oozie web console.
Start a SSH session to the virtual machine and use the following command:
The general syntax is as follows:
The following screenshot shows the output of the preceding command:
Note
Explore the output of the Oozie help job. Note the various options and commands that we can execute on a given job.
In this chapter, we saw how to run a simple Oozie job from Hue console. We discussed concepts of Workflow in detail and saw how to use Fs action. We also checked the job logs using web console and submitted the job using the command line.
In the next chapter, we will see how to submit a job without using Hue. We will discuss how to use the Oozie command-line tool to submit a job and get an idea about the job.properties
file. We will also look at Control nodes, Fork, and Join in detail.