Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce

Product typeBook

Published inFeb 2014

Publisher

ISBN-139781783285655

Edition1st Edition

Tools

Hadoop

Concepts

Data Processing

Author (1)

Khaled Tannir

The MapReduce model

MapReduce is a programming model designed for processing unstructured data by large clusters of commodity hardware and generating large datasets. It is capable of processing many terabytes of data on thousands of computing nodes in a cluster, handling failures, duplicating tasks, and aggregating results.

The MapReduce model is simple to understand. It was designed in the early 2000s by the engineers at Google Research (http://research.google.com/archive/mapreduce.html). It consists of two functions, a map function and a reduce function that can be executed in parallel on multiple machines.

To use MapReduce, the programmer writes a user-defined map function and a user-defined reduce function that expresses their desired computation. The map function reads a key/value pair, applies the user specific code, and produces results called intermediate results. Then, these intermediate results are aggregated by the reduce user-specific code that outputs the final results.

Input to a MapReduce application is organized in the records as per the input specification that will yield key/value pairs, each of which is a <k1, v1> pair.

Therefore, the MapReduce process consists of two main phases:

map(): The user-defined map function is applied to all input records one by one, and for each record it outputs a list of zero or more intermediate key/value pairs, that is, <k2, v2> records. Then all <k2, v2> records are collected and reorganized so that records with the same keys (k2) are put together into a <k2, list(v2)> record.
reduce(): The user-defined reduce function is called once for each distinct key in the map output, <k2, list(v2)> records, and for each record the reduce function outputs zero or more <k2, v3> pairs. All <k2, v3> pairs together coalesce into the final result.
Tip
The signatures of the map and reduce functions are as follows:
- map(<k1, v1>) list(<k2, v2>)
- reduce(<k2, list(v2)>) <k2, v3>

The MapReduce programming model is designed to be independent of storage systems. MapReduce reads key/value pairs from the underlying storage system through a reader. The reader retrieves each record from the storage system and wraps the record into a key/value pair for further processing. Users can add support for a new storage system by implementing a corresponding reader. This storage-independent design is considered to be beneficial for heterogeneous systems since it enables MapReduce to analyze data stored in different storage systems.

To understand the MapReduce programming model, let's assume you want to count the number of occurrences of each word in a given input file. Translated into a MapReduce job, the word-count job is defined by the following steps:

The input data is split into records.
Map functions process these records and produce key/value pairs for each word.
All key/value pairs that are output by the map function are merged together, grouped by a key, and sorted.
The intermediate results are transmitted to the reduce function, which will produce the final output.

The overall steps of this MapReduce application are represented in the following diagram:

While aggregating key/value pairs, a massive amount of I/O and network traffic I/O can be observed. To reduce the amount of network traffic required between the map and reduce steps, the programmer can optionally perform a map-side pre-aggregation by supplying a Combiner function. Combiner functions are similar to the reduce function, except that they are not passed all the values for a given key; instead, a Combiner function emits an output value that summarizes the input values it was passed.

You have been reading a chapter from

Optimizing Hadoop for MapReduce

Published in: Feb 2014Publisher: ISBN-13: 9781783285655

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.
Read more about Khaled Tannir

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages