Packt+ | Advance your knowledge in tech

You're reading from Instant MapReduce Patterns - Hadoop Essentials How-to

Product type Book

Published in May 2013

Publisher Packt

ISBN-13 9781782167709

Pages 60 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Liyanapathirannahelage H Perera

Table of Contents (7) Chapters

Instant MapReduce Patterns – Hadoop Essentials How-to

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. Instant MapReduce Patterns – Hadoop Essentials How-to

Writing a formatter (Intermediate)

By default, when you run a MapReduce job, it will read the input file line by line and feed each line into the map function. For most cases, this works well. However, sometimes one data record is contained within multiple lines. For example, as explained in the introduction, our dataset has a record format that spans multiple lines. In such cases, it is complicated to write a MapReduce job that puts those lines together and processes them.

The good news is that Hadoop lets you override the way it is reading and writing files, letting you take control of that step. We can do that by adding a new formatter. This recipe explains how to write a new formatter.

You can find the code for the formatter from src/microbook/ItemSalesDataFormat.java. The recipe will read the records from the dataset using the formatter, and count the words in the titles of the books.

Getting ready

This assumes that you have installed Hadoop and started it. Refer to the Writing a word count application using Java (Simple) and Installing Hadoop in a distributed setup and running a word count application (Simple) recipes for more information. We will use the HADOOP_HOME to refer to the Hadoop installation directory.
This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a word count application with MapReduce and running it (Simple) recipe.
Download the sample code for the chapter and copy the data files as described in the Writing a word count application with MapReduce and running it (Simple) recipe.

How to do it...

If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

>bin/hadoopdfs -mkdir /data/
>bin/hadoopdfs -mkdir /data/amazon-dataset
>bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazon-dataset/
>bin/hadoopdfs -ls /data/amazon-dataset

Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

Run the MapReduce job through the following command from HADOOP_HOME:

>bin/hadoop jar hadoop-microbook.jar  microbook.format.TitleWordCount /data/amazon-dataset /data/titlewordcount-output

You can find the result from output directory using the following command:
```
>bin/Hadoop dfs -cat /data/titlewordcount-output/*
```
You will see that it has counted the words in the book titles.

How it works...

In this recipe, we ran a MapReduce job that uses a custom formatter to parse the dataset. We enabled the formatter by adding the following highlighted line to the main program.

JobConfconf = new JobConf();
String[] otherArgs = 
    new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
    System.err.println("Usage: wordcount<in><out>");
    System.exit(2);
}

Job job = new Job(conf, "word count");
job.setJarByClass(TitleWordCount.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(ItemSalesDataFormat.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

The following code listing shows the formatter:

public class ItemSalesDataFormat 
    extends FileInputFormat<Text, Text>{
    private ItemSalesDataReadersaleDataReader = null; 

    public RecordReader<Text, Text>createRecordReader(
        InputSplitinputSplit, TaskAttemptContext attempt) {
        saleDataReader = new ItemSalesDataReader();
        saleDataReader.initialize(inputSplit, attempt);
        return saleDataReader;
    }

}

The formatter creates a record reader, and the record reader will do the bulk of the real work. When we run the Hadoop job, it will find the formatter, create a new record reader passing each file, read records from record readers, and pass those records to the map tasks.

The following code listing shows the record reader:

public class ItemSalesDataReader
extendsRecordReader<Text, Text> {

  public void initialize(InputSplitinputSplit, 
  TaskAttemptContext attempt) {
     //open the file 
  }

  public boolean nextKeyValue(){
       //parse the file until end of first record 
     }

  public Text getCurrentKey(){ ... }

  public Text getCurrentValue(){ ... }

  public float getProgress(){ ..   }

  public void close() throws IOException {
  //close the file 
  }
}

Hadoop will invoke the initialize(..) method passing the input file and call other methods until there are keys to be read. The implementation will read the next record when nextKeyValue() is invoked, and return results when the other methods are called.

Mapper and reducer look similar to the versions used in the second recipe except for the fact that mapper will read the title from the record it receives and only use the title when counting words. You can find the code for mapper and reducer at src/microbook/wordcount/TitleWordCount.java.

There's more...

Hadoop also supports output formatters, which is enabled in a similar manner, and it will return a RecordWriter instead of the reader. You can find more information at http://www.infoq.com/articles/HadoopOutputFormat or from the freely available article of the Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration.

Hadoop has several other input output formats such as ComposableInputFormat, CompositeInputFormat, DBInputFormat, DBOutputFormat, IndexUpdateOutputFormat, MapFileOutputFormat, MultipleOutputFormat, MultipleSequenceFileOutputFormat, MultipleTextOutputFormat, NullOutputFormat, SequenceFileAsBinaryOutputFormat, SequenceFileOutputFormat, TeraOutputFormat, and TextOutputFormat. In most cases, you might be able to use one of these instead of writing a new one.