Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce

Product typeBook

Published inFeb 2014

Publisher

ISBN-139781783285655

Edition1st Edition

Tools

Hadoop

Concepts

Data Processing

Author (1)

Khaled Tannir

Hadoop MapReduce internals

The MapReduce programing model can be used to process many large-scale data problems using one or more steps. Also, it can be efficiently implemented to support problems that deal with large amount of data using a large number of machines. In a Big Data context, the size of data processed may be so large that the data cannot be stored on a single machine.

In a typical Hadoop MapReduce framework, data is divided into blocks and distributed across many nodes in a cluster and the MapReduce framework takes advantage of data locality by shipping computation to data rather than moving data to where it is processed. Most input data blocks for MapReduce applications are located on the local node, so they can be loaded very fast, and reading multiple blocks can be done on multiple nodes in parallel. Therefore, MapReduce can achieve very high aggregate I/O bandwidth and data processing rate.

To launch a MapReduce job, Hadoop creates an instance of the MapReduce application and submits the job to the JobTracker. Then, the job is divided into map tasks (also called mappers) and reduce tasks (also called reducers).

When Hadoop launches a MapReduce job, it splits the input dataset into even-sized data blocks and uses a heartbeat protocol to assign a task. Each data block is then scheduled to one TaskTracker node and is processed by a map task.

Each task is executed on an available slot in a worker node, which is configured with a fixed number of map slots, and another fixed number of reduce slots. If all available slots are occupied, pending tasks must wait until some slots are freed up.

The TaskTracker node periodically sends its state to the JobTracker. When the TaskTracker node is idle, the JobTracker node assigns new tasks to it. The JobTracker node takes data locality into account when it disseminates data blocks. It always tries to assign a local data block to a TaskTracker node. If the attempt fails, the JobTracker node will assign a rack-local or random data block to the TaskTracker node instead.

When all map functions complete execution, the runtime system groups all intermediate pairs and launches a set of reduce tasks to produce the final results. It moves execution from the shuffle phase into the reduce phase. In this final reduce phase, the reduce function is called to process the intermediate data and write the final output.

Users often use terms with different granularities to specify Hadoop map and reduce tasks, subtasks, phases, and subphases. While the map task consists of two subtasks: map and merge, the reduce task consists of just one task. However, the shuffle and sort happen first and are done by the system. Each subtask in turn gets divided into many subphases such as read-map, spill, merge, copy-map, and reduce-write.

You have been reading a chapter from

Optimizing Hadoop for MapReduce

Published in: Feb 2014Publisher: ISBN-13: 9781783285655

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.
Read more about Khaled Tannir

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages