Reader small image

You're reading from  Instant MapReduce Patterns - Hadoop Essentials How-to

Product typeBook
Published inMay 2013
PublisherPackt
ISBN-139781782167709
Edition1st Edition
Tools
Right arrow
Author (1)
Liyanapathirannahelage H Perera
Liyanapathirannahelage H Perera
author image
Liyanapathirannahelage H Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.
Read more about Liyanapathirannahelage H Perera

Right arrow

Writing a word count application with MapReduce and running it (Simple)


The first recipe explained how to implement the word count application without MapReduce, and limitations of the implementation. This recipe explains how to implement a word counting application with MapReduce and explains how it works.

Getting ready

  1. This recipe assumes that you have access to a computer that has Java Development Kit (JDK) installed and the JAVA_HOME variable configured.

  2. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page.

  3. Unzip the distribution; we will call this directory HADOOP_HOME. Now you can run Hadoop jobs in local mode.

  4. Download the sample code for the book and download the data files as described in the first recipe. We call that directory as DATA_DIR.

    Tip

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

How to do it...

  1. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  2. Run the MapReduce job through the following command from HADOOP_HOME:

    $bin/hadoop -cp hadoop-microbook.jar  microbook.wordcount.WordCount amazon-meta.txt wordcount-output1
    
  3. Your can find the results from output directory.

  4. It will print the results as follows:

    B00007ELF7=1
    Vincent[412370]=2
    35681=1
    

How it works...

You can find the source code for the recipe at src/microbook/wordcount/WordCount.java.

The word count job accepts an input directory, a mapper function, and a reducer function as inputs. We use the mapper function to process the data in parallel, and we use the reducer function to collect results of the mapper and produce the final results. Mapper sends its results to reducer using a key-value based model. Let us walk through a MapReduce execution in detail.

The following diagram depicts the MapReduce job execution, and the following code listing shows the mapper and reducer functions:

When you run the MapReduce job, Hadoop first reads the input files from the input directory line by line. Then Hadoop invokes the mapper once for each line passing the line as the argument. Subsequently, each mapper parses the line, and extracts words included in the line it received as the input. After processing, the mapper sends the word count to the reducer by emitting the word and word count as name value pairs.

public void map(Object key, Text value, Context context) {
    StringTokenizeritr = new   
        StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
       word.set(itr.nextToken());
       context.write(word, one);
    }
}

Hadoop collects all name value pairs emitted from the mapper functions, and sorts them by the key. Here the key is the word and value is the number of occurrences of the word. Then it invokes the reducer once for each key passing all the values emitted against the same key as arguments. The reducer calculates the sum of those values and emits them against the key. Hadoop collects results from all reducers and writes them to the output file.

public void reduce(Text key, Iterable<IntWritable> values, 
                   Context context) {
    int sum = 0;
    for (IntWritableval : values) {
        sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
}

The following code shows the main method that will invoke the job. It configures mapper, reducer, input and output formats, and input and output directories. Here, input and output of mapper and reducer match the values configured with setOutputKeyClass(..), setOutputValueClass(..), job.setMapOutputKeyClass(..), and job.setMapOutputValueClass(..):

JobConfconf = new JobConf();
String[] otherArgs = 
    new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
    System.err.println("Usage: <in><out>");
    System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

In the model, the map function is used to process data in parallel and distribute them to the reducers, and we use the reduce function to collect the results together.

There's more...

Since we run this program in a local Hadoop installation, it will completely run in a single machine. The next recipe explains how to run it within a distributed Hadoop cluster.

Previous PageNext Page
You have been reading a chapter from
Instant MapReduce Patterns - Hadoop Essentials How-to
Published in: May 2013Publisher: PacktISBN-13: 9781782167709
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liyanapathirannahelage H Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.
Read more about Liyanapathirannahelage H Perera