Reader small image

You're reading from  Raspberry Pi Super Cluster

Product typeBook
Published inNov 2013
Reading LevelBeginner
PublisherPackt
ISBN-139781783286195
Edition1st Edition
Languages
Right arrow
Author (1)
Andrew K. Dennis
Andrew K. Dennis
author image
Andrew K. Dennis

Andrew K. Dennis is a full stack and cybersecurity architect with over 17 years' experience who currently works for Modus Create in Reston, VA. He holds two undergraduate degrees in software engineering and creative computing and a master's degree in information security. Andy has worked in the US, Canada, and the UK in software engineering, e-learning, data science, and cybersecurity across his career, and has written four books on IoT, the Raspberry Pi, and supercomputing. His interests range from the application of pataphysics in computing to security threat modeling. Andy lives in New England and is an organizer of Security BSides CT.
Read more about Andrew K. Dennis

Right arrow

Chapter 5. MapReduce Applications with Hadoop and Java

In the previous chapter we setup Hadoop across two Raspberry Pis. In this chapter we will delve into MapReduce, the core paradigm of Hadoop and run our first MapReduce application.

We will also explore some technologies used in setting up a Hadoop, cluster and cover features such as HDFS in more detail.

MapReduce


MapReduce is a programming approach that allows systems to process large datasets in parallel.

The key concept is that of using two functions, Map and Reduce, that are combined to produce a desired result.

Its genesis can be found in functional programming and has been available in languages such as LISP for several decades. Google has been a driver for bringing it out of the functional programming paradigm into the OOP (Object Orientated Programming) world. Its contributions include publishing a seminal paper on the subject in 2004, and being granted a patent on the technology.

So how does MapReduce work? The Map function takes a data set and then operates on the data, returning another data set as an output. This output is then fed to the Reduce function, which subsequently operates on the data set once again and returns a smaller data set as an output.

So let's look at an example of how the Map function operates. The pseudo code function CtoF in the following code takes a list...

MapReduce in Hadoop


In order to use Hadoop to run our MapReduce applications, there are key terminologies used in the technology we should understand.

These are briefly described as follows:

  • NameNode: The NameNode is responsible for keeping the directory tree of all the files stored in the system and tracking where the file data is stored in the cluster.

  • JobTracker: The JobTracker passes out MapReduce tasks to each Raspberry Pi in our cluster.

  • DataNode: The DataNodes in our cluster use the HDFS to store replicated data.

  • TaskTracker: A node in our Raspberry Pi cluster that accepts tasks.

  • Default configuration: A default configuration file provides the base that the website-specific files overwrite/augment. An example is the core-default.xml file.

  • Site-specific configuration: You will be familiar with this from the previous chapter. This configuration contains specifics about our own development environment, such as our Raspberry Pi's IP address.

A comprehensive guide to Hadoop's terms and...

The WordCount MapReduce program


The following code is a typical example of a MapReduce program for Hadoop. Its purpose is to take a number of input files and return a count of each word located in them.

We will be running this application in order to illustrate how the files we added to the FS can be processed.

Add the preceding code into a new file called WordCount.java created in vim /home/pi/hadoop/apps/WordCount.java:

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

The preceding code includes the various libraries we will need to use in our application. After this let's add the WordCount class:

public class WordCount {

  public static class Map extends MapReduceBase implements
  Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private...

Testing our application


Testing our application is a relatively simple task.

Let's start with compiling the code. All the following commands should be run from inside the following directory:

/home/pi/hadoop/apps

Inside of this apps folder create a new directory called wordcount_classes; this will be where we store our outputted JAR file. You can use the following mkdir command:

mkdir wordcount_classes

The tool we use for compiling Java code is javac. This handy command-line utility will take a number of inputs and compile our Java code, outputting an application we can run.

The javac command takes several flags and parameters. The first flag is –classpath. This flag points to the Hadoop core JAR file. This is required at the compilation time in order to generate our application. Following this we include the –d flag. The –d flag sets the destination directory for our class file.

Finally we reference the Java file we wish to compile, in this case WordCount.java, as follows:

javac -classpath /home...

Summary


In this chapter we learned about the MapReduce paradigm. We also explored the Hadoop file system and tried out an application that processed a number of text documents to return a word count.

We shall now move on to writing our own MapReduce application for calculating π using a process known as a Monte Carlo simulator.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Raspberry Pi Super Cluster
Published in: Nov 2013Publisher: PacktISBN-13: 9781783286195
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrew K. Dennis

Andrew K. Dennis is a full stack and cybersecurity architect with over 17 years' experience who currently works for Modus Create in Reston, VA. He holds two undergraduate degrees in software engineering and creative computing and a master's degree in information security. Andy has worked in the US, Canada, and the UK in software engineering, e-learning, data science, and cybersecurity across his career, and has written four books on IoT, the Raspberry Pi, and supercomputing. His interests range from the application of pataphysics in computing to security threat modeling. Andy lives in New England and is an organizer of Security BSides CT.
Read more about Andrew K. Dennis