Developing MapReduce Applications

"Programs must be written for people to read, and only incidentally for machines to execute."

– Harold Abelson, Structure and Interpretation of Computer Programs, 1984

When Apache Hadoop was designed, it was intended for large-scale processing of humongous data, where traditional programming techniques could not be applied. This was at a time when MapReduce was considered a part of Apache Hadoop. Earlier, MapReduce was the only programming option available in Hadoop; however, with new Hadoop releases, it was enhanced with YARN. It's also called MRv2 and older MapReduce is usually referred to as MRv1. In the previous chapter, we saw how HDFS can be configured and used for various application usages. In this chapter, we will do a deep dive into MapReduce programming to learn the different facets of how you can effectively...

Technical requirements

You will need Eclipse development environment and Java 8 installed on your system where you can run/tweak these examples. If you prefer to use maven, then you will need maven installed to compile the code. To run the example, you also need Apache Hadoop 3.1 setup on Linux system. Finally, to use the Git repository of this book, you need to install Git.

The code files of this chapter can be found on GitHub:

https://github.com/PacktPublishing/Apache-Hadoop-3-Quick-Start-Guide/tree/master/Chapter4

Check out the following video to see the code in action:

http://bit.ly/2znViEb

How MapReduce works

MapReduce is a programming methodology used for writing programs on Apache Hadoop. It allows the programs to run on a large scalable cluster of servers. MapReduce was inspired by functional programming (https://en.wikipedia.org/wiki/Functional_programming). Functional Programming (FP) offers amazing unique features when compared to today's popular programming paradigms such as object-oriented (Java and JavaScript), declarative (SQL and CSS), or procedural (C, PHP, and Python). You can look at a comparison between multiple programming paradigms here. While we see a lot of interest in functional programming in academics, we rarely see equivalent enthusiasm from the developer community. Many developers and mentors claim that MapReduce is not actually a functional programming paradigm. Higher order functions in FP are functions that can take a function as...

Configuring a MapReduce environment

When you install the Hadoop environment, the default environment is set up with MapReduce. You do not need to make any major changes in configuration. However, if you wish to run MapReduce program in an environment that is already set up, please ensure that the following property is set to local or classic in mapred-site.xml:

<property>
    <name>mapreduce.framework.name</name>
    <value>local</value>
</property>

I have elaborated on this property in detail in the next section.

Working with mapred-site.xml

We have seen core-site.xml and hdfs-site.xml files in previous files. To configure MapReduce, primarily Hadoop provides mapred-site.xml. In addition...

Understanding Hadoop APIs and packages

Now let's go through some of the key APIs that you will be using while you program in MapReduce. First, let's understand the important packages that are part of Apache Hadoop MapReduce APIs and their capabilities:

...

Java API Packages	Description
`org.apache.Hadoop.mapred`	Primarily provides interfaces for MapReduce, input/output formats, and job-related classes. This is an older API.
`org.apache.Hadoop.mapred.lib`	Contains libraries for Mapper, Reducer, partitioners, and so on. To be avoided—use `mapreduce.lib`.
`org.apache.Hadoop.mapred.pipes`	Job submitter-related classes.
`org.apache.Hadoop.mapred.tools`	Command-line tools associated with MapReduce.
`org.apache.Hadoop.mapred.uploader`	The `org.apache.Hadoop.mapred.uploader` package contains classes related to the MapReduce framework upload tool.

Setting up a MapReduce project

In this section, we will learn how to create the environment to start writing applications for MapReduce programming. The programming is typically done in Java. The development of a MapReduce application follows standard Java development principles as follows:

Usually, developers write the programs in a development environment such as Eclipse or NetBeans.
Developers do unit testing usually with a small subset of data. In case of failure, they can run an IDE Debugger to do fault identification.
It is then packaged in JAR files and is tested in a standalone fashion for functionality.
Developers should ideally write unit test cases to test each functionality.
Once it is tested in standalone mode, developers should test it in a cluster or pseudo-distributed environment with full datasets. This will expose more problems, and they can be fixed. Here debugging...

Deep diving into MapReduce APIs

Let's start looking at different types of data structures and classes that you will be using while writing MapReduce programs. We will be looking data structures of input and output to MapReduce, and different classes that you can use for Mapper, Combiner, Shuffle, and Reducer.

Configuring MapReduce jobs

Usually, when you write programs in MapReduce, you start with configuration APIs first. In our programs that we have run in previous chapters, the following code represents the configuration part:

The Configuration class (part of the org.apache.Hadoop.conf package) provides access to different configuration parameters. The API reads properties from the supplied file. The configuration...

Compiling and running MapReduce jobs

In this section, we will cover compiling and running MapReduce jobs. We have already seen examples of how jobs can be run on standalone, pseudo-development, and cluster environments. You need to remember that, when you compile the classes, you must do it with same versions of your libraries and Java that you will otherwise run in production, otherwise you may get major-minor version mismatch errors in your run-time (read the description here). In almost all cases, the JAR for programs is created and run directly through the following command:

Hadoop jar <jarfile> <parameters>

Now let's look at different alternatives available for running the jobs.

Triggering the job remotely

...

Streaming in MapReduce programming

The traditional MapReduce programming requires users to write map and reduction functions as per the specifications of the defined API. However, what if I already have a processing function written, and I want to federate the processing to my own function, still using the MapReduce concept over Hadoop's distributed File System? There is a possibility to solve this with the streaming and pipes functions of Apache Hadoop.

Hadoop streaming allows user to code their logic in any programming language such as C, C++, and Python, and it provides a hook for the custom logic to integrate with traditional MapReduce framework with no or minimal lines of Java code. The Hadoop streaming APIs allow users to run any scripts or executables outside of the traditional Java platform. This capability is similar to Unix's Pipe function (https://en.wikipedia...

Summary

In this chapter, we have gone through various topics pertaining to MapReduce with a deeper walk through. We started with understanding the concept of MapReduce and an example of how it works. We started configuring the config files for a MapReduce environment; we also configured Job history server. We then looked at Hadoop application URLs, ports, and so on. Post-configuration, we focused on some hands-on work of setting up a MapReduce project and going through Hadoop packages, and then we did a deeper dive into writing MapReduce programs. We also studied different data formats needed for MapReduce. Later, we looked at job compilation, remote job run, and using utilities such as Tool for a simple life. We then studied unit testing and failure handling.

Now that you are able to write applications in MapReduce, in the next chapter, we will start looking at building applications...