Packt+ | Advance your knowledge in tech

You're reading from Instant MapReduce Patterns - Hadoop Essentials How-to

Product typeBook

Published inMay 2013

PublisherPackt

ISBN-139781782167709

Edition1st Edition

Tools

Hadoop

Concepts

Data Processing

Author (1)

Liyanapathirannahelage H Perera

Set operations with MapReduce (Intermediate)

Set operations are a useful tool we use when we want to understand a dataset. This recipe will explain how to use MapReduce to perform a set operation on a large dataset. The following MapReduce job calculates the set difference between the customers who have bought the items that have an amazon sales rank less than 100 and most frequent customers which we calculated in the earlier recipe.

Getting ready

This assumes that you have installed Hadoop and started it. Refer to the Writing a word count application using Java (Simple) and Installing Hadoop in a distributed setup and running a word count application (Simple) recipes for more information. We will use HADOOP_HOME to refer to the Hadoop installation directory.
This recipe assumes you are aware of how Hadoop processing works. If you have not already done so, you should follow the Writing a word count application with MapReduce and running it (Simple) recipe.
Download the sample code for the chapter and download the data files as described in the Writing a word count application with MapReduce and running it (Simple) recipe. Select a subset of data from the Amazon dataset if you are running this with few computers. You can find the smaller dataset with the sample directory.
This sample uses the data created from earlier recipes. If you have not already run it, please run it first.

How to do it...

If you have not already done so, let us upload the amazon dataset to the HDFS filesystem using the following commands:

> bin/hadoop dfs -mkdir /data/
> bin/hadoop dfs -mkdir /data/amazon-dataset
> bin/hadoop dfs -put <SMAPLE_DIR>/amazon-meta.txt /data/amazon-dataset
> bin/Hadoop dfs –mkdir /data/set-input

Copy the output from earlier recipes to the output directory.

>bin/hadoop dfs -cp 
    /data/frequency-output1/part-r-00000   
    /data/set-input/mostFrequentBuyers.data

Copy the hadoop-microbook.jar file from SAMPLE_DIR to HADOOP_HOME.

Run the first MapReduce job. To do that run the following command from HADOOP_HOME:

$ bin/hadoop jar hadoop-microbook.jar  microbook.set.FindCustomersBroughtFirst100Items /data/amazon-dataset /data/set-output1

Copy the output of the MapReduce job and output of the earlier recipe to the input directory.

> bin/hadoop dfs -cp /data/set-output1/part-r-00000  /data/set-input/first100ItemBuyers.data

Run the second MapReduce job. To do that run the following command from HADOOP_HOME:

$ bin/hadoop jar hadoop-microbook.jar  microbook.set.BuyersSetDifference /data/set-input /data/set-output2

You can find the results from the output directory at /data/set-output2.

How it works...

You can find the mapper and reducer code at src/microbook/BuyersSetDifference.java.

We define the set difference between the two sets S1 and S2, written as S1-S2, as the items that are in set S1 but not in set S2.

To perform set difference, we label each element at the mapper with the set it came from. Then send the search to a reducer, which emits an item only if it is in the first set, but not in the second set. The preceding figure shows the execution of the MapReduce job. Also the following code listing shows the map function and the reduce function.

Let us look at the execution in detail.

Here we put files for both sets into the same input directory. Hadoop will read the input files from the input folder and read records from each file. It invokes the mapper once per each record passing the record as input.

When the mapper receives an input, we finds out which line belongs to which set by getting the filename using InputSplit available through the Hadoop context. Then we emit elements in the set as the key and the set name (1 or 2) as the value.

public void map(Object key, Text value, Context context) {
    String currentFile =  ((FileSplit)context.getInputSplit()).getPath().getName();

    Matcher matcher = 
    parsingPattern.matcher(value.toString());
    if (matcher.find()) {
        String propName = matcher.group(1);
        String propValue = matcher.group(2);
        if(currentFile.equals("first100ItemBuyers.data")){
            context.write(new Text(propName), 
            new IntWritable(1));
}else{ if(currentFile.equals("mostFrequentBuyers.data")){
            int count = Integer.parseInt(propValue);
            if(count > 100){
                context.write(new Text(propName), 
new IntWritable(2));                        
            }
        }else{
            throw new IOException("Unexpected file "
+ currentFile); 
        }
    } else {
        System.out.println(currentFile 
+ ":Unprocessed Line " + value);
    }
}

Hadoop will sort the key-value pairs by the key and invoke the reducer once for each unique key, passing the list of values as the second argument. The reducer inspects the list of values, which contain the name of sets the values comes from, and then emits the key only if the given value is in the first set, but not in the second.

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
        InterruptedException {
    boolean has1 = false;
    boolean has2 = false;
    System.out.print(key + "=");
    for (IntWritable val : values) {
        switch(val.get()){
            case 1:
                has1 = true;
                break;
            case 2:
                has2 = true;
                break;
        }
        System.out.println(val);
    }
    if(has1 && !has2){
        context.write(key, new IntWritable(1));    
    }
}

There's more...

We can use MapReduce to implement most set operations by labeling elements against the sets they came from using a similar method and changing the reducer logic to emit only relevant elements. For example, we can implement the set intersection by changing the reducer to emit only elements that have both sets as values.

You have been reading a chapter from

Instant MapReduce Patterns - Hadoop Essentials How-to

Published in: May 2013Publisher: PacktISBN-13: 9781782167709

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Liyanapathirannahelage H Perera

Srinath Perera is a senior software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2 open source project, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation and Apache Web Service project PMC. He is also a committer of Apache open source projects Axis, Axis2, and Geronimo. He received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering degree from the University of Moratuwa, Sri Lanka. He has authored many technical and peer reviewed research articles, and more details can be found on his website. He is also a frequent speaker at technical venues. He has worked with large-scale distributed systems for a long time. He closely works with Big Data technologies like Hadoop and Cassandra daily. He also teaches a parallel programming graduate class at University of Moratuwa, which is primarily based on Hadoop.
Read more about Liyanapathirannahelage H Perera

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages