Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Instant MapReduce Patterns - Hadoop Essentials How-to

You're reading from  Instant MapReduce Patterns - Hadoop Essentials How-to

Product type Book
Published in May 2013
Publisher Packt
ISBN-13 9781782167709
Pages 60 pages
Edition 1st Edition
Languages
Author (1):
Liyanapathirannahelage H Perera Liyanapathirannahelage H Perera
Profile icon Liyanapathirannahelage H Perera

Writing a formatter (Intermediate)


By default, when you run a MapReduce job, it will read the input file line by line and feed each line into the map function. For most cases, this works well. However, sometimes one data record is contained within multiple lines. For example, as explained in the introduction, our dataset has a record format that spans multiple lines. In such cases, it is complicated to write a MapReduce job that puts those lines together and processes them.

The good news is that Hadoop lets you override the way it is reading and writing files, letting you take control of that step. We can do that by adding a new formatter. This recipe explains how to write a new formatter.

You can find the code for the formatter from src/microbook/ItemSalesDataFormat.java. The recipe will read the records from the dataset using the formatter, and count the words in the titles of the books.

Getting ready

  1. This assumes that you have installed Hadoop and started it. Refer to the Writing a word count application using Java (Simple) and Installing Hadoop in a distributed setup and running a word count application (Simple) recipes for more information. We will use the HADOOP_HOME to refer to the Hadoop installation directory.

  2. This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a word count application with MapReduce and running it (Simple) recipe.

  3. Download the sample code for the chapter and copy the data files as described in the Writing a word count application with MapReduce and running it (Simple) recipe.

How to do it...

  1. If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

    >bin/hadoopdfs -mkdir /data/
    >bin/hadoopdfs -mkdir /data/amazon-dataset
    >bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazon-dataset/
    >bin/hadoopdfs -ls /data/amazon-dataset
    
  2. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

  3. Run the MapReduce job through the following command from HADOOP_HOME:

    >bin/hadoop jar hadoop-microbook.jar  microbook.format.TitleWordCount /data/amazon-dataset /data/titlewordcount-output
    
  4. You can find the result from output directory using the following command:

    >bin/Hadoop dfs -cat /data/titlewordcount-output/*
    

    You will see that it has counted the words in the book titles.

How it works...

In this recipe, we ran a MapReduce job that uses a custom formatter to parse the dataset. We enabled the formatter by adding the following highlighted line to the main program.

JobConfconf = new JobConf();
String[] otherArgs = 
    new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
    System.err.println("Usage: wordcount<in><out>");
    System.exit(2);
}

Job job = new Job(conf, "word count");
job.setJarByClass(TitleWordCount.class);
job.setMapperClass(WordcountMapper.class);
job.setReducerClass(WordcountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(ItemSalesDataFormat.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

The following code listing shows the formatter:

public class ItemSalesDataFormat 
    extends FileInputFormat<Text, Text>{
    private ItemSalesDataReadersaleDataReader = null; 

    public RecordReader<Text, Text>createRecordReader(
        InputSplitinputSplit, TaskAttemptContext attempt) {
        saleDataReader = new ItemSalesDataReader();
        saleDataReader.initialize(inputSplit, attempt);
        return saleDataReader;
    }

}

The formatter creates a record reader, and the record reader will do the bulk of the real work. When we run the Hadoop job, it will find the formatter, create a new record reader passing each file, read records from record readers, and pass those records to the map tasks.

The following code listing shows the record reader:

public class ItemSalesDataReader
extendsRecordReader<Text, Text> {

  public void initialize(InputSplitinputSplit, 
  TaskAttemptContext attempt) {
     //open the file 
  }

  public boolean nextKeyValue(){
       //parse the file until end of first record 
     }

  public Text getCurrentKey(){ ... }

  public Text getCurrentValue(){ ... }

  public float getProgress(){ ..   }

  public void close() throws IOException {
  //close the file 
  }
}

Hadoop will invoke the initialize(..) method passing the input file and call other methods until there are keys to be read. The implementation will read the next record when nextKeyValue() is invoked, and return results when the other methods are called.

Mapper and reducer look similar to the versions used in the second recipe except for the fact that mapper will read the title from the record it receives and only use the title when counting words. You can find the code for mapper and reducer at src/microbook/wordcount/TitleWordCount.java.

There's more...

Hadoop also supports output formatters, which is enabled in a similar manner, and it will return a RecordWriter instead of the reader. You can find more information at http://www.infoq.com/articles/HadoopOutputFormat or from the freely available article of the Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration.

Hadoop has several other input output formats such as ComposableInputFormat, CompositeInputFormat, DBInputFormat, DBOutputFormat, IndexUpdateOutputFormat, MapFileOutputFormat, MultipleOutputFormat, MultipleSequenceFileOutputFormat, MultipleTextOutputFormat, NullOutputFormat, SequenceFileAsBinaryOutputFormat, SequenceFileOutputFormat, TeraOutputFormat, and TextOutputFormat. In most cases, you might be able to use one of these instead of writing a new one.

arrow left Previous Chapter
You have been reading a chapter from
Instant MapReduce Patterns - Hadoop Essentials How-to
Published in: May 2013 Publisher: Packt ISBN-13: 9781782167709
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}