Packt+ | Advance your knowledge in tech

You're reading from Raspberry Pi Super Cluster

Product typeBook

Published inNov 2013

Reading LevelBeginner

PublisherPackt

ISBN-139781783286195

Edition1st Edition

Languages

Python

Tools

Raspberry Pi

Concepts

Single Board Computers

Author (1)

Andrew K. Dennis

Chapter 5. MapReduce Applications with Hadoop and Java

In the previous chapter we setup Hadoop across two Raspberry Pis. In this chapter we will delve into MapReduce, the core paradigm of Hadoop and run our first MapReduce application.

We will also explore some technologies used in setting up a Hadoop, cluster and cover features such as HDFS in more detail.

MapReduce

MapReduce is a programming approach that allows systems to process large datasets in parallel.

The key concept is that of using two functions, Map and Reduce, that are combined to produce a desired result.

Its genesis can be found in functional programming and has been available in languages such as LISP for several decades. Google has been a driver for bringing it out of the functional programming paradigm into the OOP (Object Orientated Programming) world. Its contributions include publishing a seminal paper on the subject in 2004, and being granted a patent on the technology.

So how does MapReduce work? The Map function takes a data set and then operates on the data, returning another data set as an output. This output is then fed to the Reduce function, which subsequently operates on the data set once again and returns a smaller data set as an output.

So let's look at an example of how the Map function operates. The pseudo code function CtoF in the following code takes a list...

MapReduce in Hadoop

In order to use Hadoop to run our MapReduce applications, there are key terminologies used in the technology we should understand.

These are briefly described as follows:

NameNode: The NameNode is responsible for keeping the directory tree of all the files stored in the system and tracking where the file data is stored in the cluster.
JobTracker: The JobTracker passes out MapReduce tasks to each Raspberry Pi in our cluster.
DataNode: The DataNodes in our cluster use the HDFS to store replicated data.
TaskTracker: A node in our Raspberry Pi cluster that accepts tasks.
Default configuration: A default configuration file provides the base that the website-specific files overwrite/augment. An example is the core-default.xml file.
Site-specific configuration: You will be familiar with this from the previous chapter. This configuration contains specifics about our own development environment, such as our Raspberry Pi's IP address.

A comprehensive guide to Hadoop's terms and...

The WordCount MapReduce program

The following code is a typical example of a MapReduce program for Hadoop. Its purpose is to take a number of input files and return a count of each word located in them.

We will be running this application in order to illustrate how the files we added to the FS can be processed.

Add the preceding code into a new file called WordCount.java created in vim /home/pi/hadoop/apps/WordCount.java:

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

The preceding code includes the various libraries we will need to use in our application. After this let's add the WordCount class:

public class WordCount {

  public static class Map extends MapReduceBase implements
  Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private...

Testing our application

Testing our application is a relatively simple task.

Let's start with compiling the code. All the following commands should be run from inside the following directory:

/home/pi/hadoop/apps

Inside of this apps folder create a new directory called wordcount_classes; this will be where we store our outputted JAR file. You can use the following mkdir command:

mkdir wordcount_classes

The tool we use for compiling Java code is javac. This handy command-line utility will take a number of inputs and compile our Java code, outputting an application we can run.

The javac command takes several flags and parameters. The first flag is –classpath. This flag points to the Hadoop core JAR file. This is required at the compilation time in order to generate our application. Following this we include the –d flag. The –d flag sets the destination directory for our class file.

Finally we reference the Java file we wish to compile, in this case WordCount.java, as follows:

javac -classpath /home...

Summary

In this chapter we learned about the MapReduce paradigm. We also explored the Hadoop file system and tried out an application that processed a number of text documents to return a word count.

We shall now move on to writing our own MapReduce application for calculating π using a process known as a Monte Carlo simulator.

The rest of the chapter is locked

You have been reading a chapter from

Raspberry Pi Super Cluster

Published in: Nov 2013Publisher: PacktISBN-13: 9781783286195

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Andrew K. Dennis

Andrew K. Dennis is a full stack and cybersecurity architect with over 17 years' experience who currently works for Modus Create in Reston, VA. He holds two undergraduate degrees in software engineering and creative computing and a master's degree in information security. Andy has worked in the US, Canada, and the UK in software engineering, e-learning, data science, and cybersecurity across his career, and has written four books on IoT, the Raspberry Pi, and supercomputing. His interests range from the application of pataphysics in computing to security threat modeling. Andy lives in New England and is an organizer of Security BSides CT.
Read more about Andrew K. Dennis

Personalised recommendations for you

Based on your interests and search pattern

Architectural Patterns and Techniques for Developing IoT Solutions

This book covers all the patterns and considerations that give you both the power and flexibility to build scalable, secure, and performant IoT solutions by combining various patterns in interesting ways. It also lists the benefits of combining IoT with technologies like blockchain, 3D-printing, 5G, Generative AI, quantum computing, and LLMs.

BookSep 2023304 pages

Arduino Data Communications

Arduino Data Communication focuses on IoT’s Internet aspect, guiding you in setting up your own infrastructure for storing and managing the data collected from sensors. This book goes beyond microcontroller basics, equipping you with the knowledge essential for building real-world projects.

BookNov 2023286 pages5

Arduino IoT Cloud for Developers

From fundamental principles to advanced techniques, this comprehensive book equips you with the knowledge and skills needed to design and deploy IoT applications seamlessly. Explore cloud integration, best practices, and real-world projects to harness the full potential of IoT application development with the Arduino IoT Cloud.

BookNov 2023402 pages

The Azure IoT Handbook

Building IoT Systems with Azure IoT is a comprehensive introduction for those who are new to the Internet of Things and looking to get up to speed in no time. This book will teach you how to create and develop IoT solutions with intelligent edge-to-cloud technologies in the Azure cloud.

BookDec 2023248 pages