Analytics – Drawing a Frequency Distribution with MapReduce (Intermediate)

Exclusive offer: get 50% off this eBook here
Instant MapReduce Patterns – Hadoop Essentials How-to [Instant]

Instant MapReduce Patterns – Hadoop Essentials How-to [Instant] — Save 50%

Practical recipes to write your own MapReduce solution patterns for Hadoop programs with this book and ebook

$19.99    $10.00
by Srinath Perera | August 2013 | Open Source

This article by Srinath Perera the author of Instant MapReduce Patterns – Hadoop Essentials How-to, will explain how to use MapReduce to calculate frequency distribution of the number of items brought by each customer. Then we will use gnuplot, a free and powerful, plotting program to plot results from the Hadoop job.

(For more resources related to this topic, see here.)

Often, we use Hadoop to calculate analytics, which are basic statistics about data. In such cases, we walk through the data using Hadoop and calculate interesting statistics about the data. Some of the common analytics are show as follows:

  • Calculating statistical properties like minimum, maximum, mean, median, standard deviation, and so on of a dataset. For a dataset, generally there are multiple dimensions (for example, when processing HTTP access logs, names of the web page, the size of the web page, access time, and so on, are few of the dimensions). We can measure the previously mentioned properties by using one or more dimensions. For example, we can group the data into multiple groups and calculate the mean value in each case.
  • Frequency distributions histogram counts the number of occurrences of each item in the dataset, sorts these frequencies, and plots different items as X axis and frequency as Y axis.
  • Finding a correlation between two dimensions (for example, correlation between access count and the file size of web accesses).
  • Hypothesis testing: To verify or disprove a hypothesis using a given dataset.

However, Hadoop will only generate numbers. Although the numbers contain all the information, we humans are very bad at figuring out overall trends by just looking at numbers. On the other hand, the human eye is remarkably good at detecting patterns, and plotting the data often yields us a deeper understanding of the data. Therefore, we often plot the results of Hadoop jobs using some plotting program.

Getting ready

  1. This article assumes that you have access to a computer that has Java installed and the JAVA_HOME variable configured.
  2. Download a Hadoop distribution 1.1.x from http://hadoop.apache.org/releases.html page.
  3. Unzip the distribution, we will call this directory HADOOP_HOME.
  4. Download the sample code for the article and copy the data files.

How to do it...

  1. If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

    >bin/hadoopdfs -mkdir /data/
    >bin/hadoopdfs -mkdir /data/amazon-dataset
    >bin/hadoopdfs -put <SAMPLE_DIR>/amazon-meta.txt /data/amazondataset/
    >bin/hadoopdfs -ls /data/amazon-dataset

  2. Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.
  3. Run the first MapReduce job to calculate the buying frequency. To do that run the following command from HADOOP_HOME:

    $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
    BuyingFrequencyAnalyzer/data/amazon-dataset /data/frequencyoutput1

  4. Use the following command to run the second MapReduce job to sort the results of the first MapReduce job:

    $ bin/hadoop jar hadoop-microbook.jar microbook.frequency.
    SimpleResultSorter /data/frequency-output1 frequency-output2

  5. You can find the results from the output directory. Copy results to HADOOP_HOME using the following command:

    $ bin/Hadoop dfs -get /data/frequency-output2/part-r-00000 1.data

  6. Copy all the *.plot files from SAMPLE_DIR to HADOOP_HOME.
  7. Generate the plot by running the following command from HADOOP_HOME.

    $gnuplot buyfreq.plot

  8. It will generate a file called buyfreq.png, which will look like the following:

As the figure depicts, few buyers have brought a very large number of items. The distribution is much steeper than normal distribution, and often follows what we call a Power Law distribution. This is an example that analytics and plotting results would give us insight into, underlying patterns in the dataset.

How it works...

You can find the mapper and reducer code at src/microbook/frequency/BuyingFrequencyAnalyzer.java.

This figure shows the execution of two MapReduce jobs. Also the following code listing shows the map function and the reduce function of the first job:

public void map(Object key, Text value, Context context
) throwsIOException, InterruptedException {
List<BuyerRecord> records =
BuyerRecord.parseAItemLine(value.toString());
for(BuyerRecord record: records){
context.write(new Text(record.customerID),
new IntWritable(record.itemsBrought.size()));
}
}
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}

As shown by the figure, Hadoop will read the input file from the input folder and read records using the custom formatter we introduced in the Writing a formatter (Intermediate) article. It invokes the mapper once per each record, passing the record as input.

The mapper extracts the customer ID and the number of items the customer has brought, and emits the customer ID as the key and number of items as the value.

Then, Hadoop sorts the key-value pairs by the key and invokes a reducer once for each key passing all values for that key as inputs to the reducer. Each reducer sums up all item counts for each customer ID and emits the customer ID as the key and the count as the value in the results.

Then the second job sorted the results. It reads output of the first job as the result and passes each line as argument to the map function. The map function extracts the customer ID and the number of items from the line and emits the number of items as the key and the customer ID as the value. Hadoop will sort the key-value pairs by the key, thus sorting them by the number of items, and invokes the reducer once per key in the same order. Therefore, the reducer prints them out in the same order essentially sorting the dataset.

Since we have generated the results, let us look at the plotting. You can find the source for the gnuplot file from buyfreq.plot. The source for the plot will look like the following:

set terminal png
set output "buyfreq.png"
set title "Frequency Distribution of Items brought by Buyer";
setylabel "Number of Items Brought";
setxlabel "Buyers Sorted by Items count";
set key left top
set log y
set log x
plot "1.data" using 2 title "Frequency" with linespoints

Here the first two lines define the output format. This example uses png, but gnuplot supports many other terminals such as screen, pdf, and eps. The next four lines define the axis labels and the title, and the next two lines define the scale of each axis, and this plot uses log scale for both.

The last line defines the plot. Here, it is asking gnuplot to read the data from the 1.data file, and to use the data in the second column of the file via using 2, and to plot it using lines. Columns must be separated by whitespaces.

Here if you want to plot one column against another, for example data from column 1 against column 2, you should write using 1:2 instead of using 2.

There's more...

We can use a similar method to calculate the most types of analytics and plot the results. Refer to the freely available article of Hadoop MapReduce Cookbook, Srinath Perera and Thilina Gunarathne, Packt Publishing at http://www.packtpub.com/article/advanced-hadoop-mapreduce-administration for more information.

Summary

In this article, we have learned how to process Amazon data with MapReduce, generate data for a histogram, and plot it using gnuplot.

Resources for Article :


Further resources on this subject:


Instant MapReduce Patterns – Hadoop Essentials How-to [Instant] Practical recipes to write your own MapReduce solution patterns for Hadoop programs with this book and ebook
Published: May 2013
eBook Price: $19.99
See more
Select your format and quantity:

About the Author :


Srinath Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo.

He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka.

He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues.

He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.

Books From Packt


HBase Administration Cookbook
HBase Administration Cookbook

Hadoop MapReduce Cookbook
Hadoop MapReduce Cookbook

Hadoop Beginner's Guide
Hadoop Beginner's Guide

Hadoop Operations and Cluster Management Cookbook
Hadoop Operations and Cluster Management Cookbook

Hadoop Real-World Solutions Cookbook
Hadoop Real-World Solutions Cookbook

Apache Flume: Distributed Log Collection for Hadoop
Apache Flume: Distributed Log Collection for Hadoop

Instant Apache Sqoop [Instant]
Instant Apache Sqoop [Instant]

Apache Mahout Cookbook
Apache Mahout Cookbook


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software