Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 4. Internals of MapReduce

The last chapter was all about managing resources on a Hadoop cluster and we went through details of the YARN architecture, execution, and a few examples. In this chapter, we will talk more about the MapReduce processing framework and how it has evolved over time. We will try to simplify how the overall MapReduce processing works and learn about what the major steps involved in the process are. The topics that will be covered in this chapter are as follows:

Deep dive into the Hadoop MapReduce framework
YARN and MapReduce
MapReduce workflow in a Hadoop framework
Important MapReduce parameters
Common MapReduce patterns
MapReduce examples in our use case
Optimizing MapReduce
MapReduce command reference

Technical requirements

You will be required to have Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter04

Check out the following video to see the code in action:http://bit.ly/2PY4oP

Deep dive into the Hadoop MapReduce framework

The story of Hadoop started with HDFS and MapReduce. Hadoop version 1 has the basic features for storing and processing data over a distributed platform and since then it has evolved a lot. Hadoop version 2 added major changes, such as NameNode, high availability, and a new resource management framework called YARN. However, the high-level flow for MapReduce processing did not change despite various changes in its API.

MapReduce consists of two major steps: map and reduce, and multiple minor steps that are part of the process flow from map to reduce tasks. The mappers are responsible for performing map tasks while reducers are responsible for the reduce tasks. The job of the mapper is to process the blocks stored on HDFS, like the distributed storage system. Let's us look at the following MapReduce flow diagram:

We will understand the processing flow as follows:

InputFileFormat: The MapReduce process starts with reading the file stored on HDFS...

YARN and MapReduce

We have covered enough information about YARN in previous chapters. In this section, we will talk about the execution of MapReduce over YARN. The JobTracker in Hadoop version 1 has a bottleneck due to a scalability limit of 4,000 nodes. Yahoo realizes that their current requirement needs a scaling of up to 20,000 nodes. The latter was certainly not possible due to the legacy architecture of the job tracker. Yahoo then introduced YARN, which broke the function of the job tracker for efficient management. We covered the detail architecture in Chapter 3, YARN Resource Management in Hadoop.

The node manager in YARN has enough memory to launch multiple containers. The application master can request any number of containers from the resource manager, which keeps track of the available resources in the YARN cluster. The job type is not limited to MapReduce; instead, YARN can launch any type of application. Let's take a look at the life cycle of a MapReduce application on YARN...

MapReduce workflow in the Hadoop framework

The MapReduce execution goes through various steps and each step has scope for a little optimization. In the previous sections, we have covered the components of the MapReduce framework and now we will briefly look into the MapReduce execution flow, which will help us understand how each component interacts with each other. The following diagram gives a brief overview about the MapReduce execution flow. We have divided the diagram into smaller parts so that each step looks easier to understand. The step numbers are mentioned over arrow connectors and the last arrow in the diagram connects to the following diagram in the section:

We will explain the different steps of the MapReduce internal flow here as follows:

The InputFormat is the starting point of any MapReduce application. It is defined in the job configuration in the Driver class of the application, for example, job.setInputFormatClass(TextInputFormat.class). The InputFormat helps in understanding...

Common MapReduce patterns

The design patterns are the solution templates for solving specific problems. Developers can reuse templates for similar problems across domains so that they save time in solving problems. If you are a programmer, you would have used the abstract factory pattern, builder pattern, observer pattern, and so on before. These patterns are discovered by people who have been solving similar problems for many years. The MapReduce framework has existed for almost a decade now. Let's look into a few of the commonly used MapReduce design patterns across industries.

Summarization patterns

Summarization problems use the pattern widely across domains. It's all about grouping similar data together and then performing an operation such as calculating a minimum, maximum, count, average, median-standard deviation, building an index, or just simply counting based on key. For example, we might want to calculate the total amount of money our website has made by country. As another example...

MapReduce use case

We will cover a use case to find out the top 20 highly rated movies and will consider the condition that movies should have been rated by more than 100 people. The filter pattern we discussed earlier is a good fit for the use case. The format of the data is as follows:

`title`	`averageRating`	`numVotes`
`tt0000001`	`5.8`	`1374`

The title code refers to the specific movie. The rating is based on a 10 point scale. Let's look into the mapper, reduce code, and driver code. The template can also be used for similar use cases.

MovieRatingMapper

The job of the mapper is to process the record and emit the top 20 records it has processed for input split. We are also filtering out movies that have not been rated by at least 100 people. The code is as follows:

import org.apache.Hadoop.io.LongWritable;
import org.apache.Hadoop.io.Text;
import org.apache.Hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;

public class MovieRatingMapper extends
Mapper...

Optimizing MapReduce

The MapReduce framework provides a massive advantage for improving performance for large datasets as we can add more nodes to get more performance. The resources such as node, memory, and disk require significant investment, thus only adding the node should not be a parameter for performance optimization. Sometimes, adding more nodes does not help in getting more performance as the application performance could be something else, such as code optimization, unwanted data transfer, and so on. In this section, we will discuss some of the best practices to optimize the MapReduce application.

The performance of the application is measured by the overall processing time taken by the application. MapReduce processes data in parallel and thus it already provides a performance advantage over your MapReduce application. The following factors play important roles in optimizing MapReduce performance.

Hardware configuration

Hardware setup is the first step in the Hadoop installation...

Summary

In this chapter, we learned about MapReduce processing and overall processing works internally. We also looked into how MapReduce jobs are submitted to YARN and how YARN works to make sure your MapReduce job runs with efficiency and delivers status to you on successful completion. In the latter half of the chapter, we covered different design patterns that are commonly used in the industry and also covered basic templates to use these patterns. In the next chapter, we will look into various Hadoop components added for efficient processing over HDFS. We will also look into some of the SQL engines used for data analytics and report generation.

The rest of the chapter is locked

You have been reading a chapter from

Mastering Hadoop 3

Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime}

Authors (2)

Chanchal Singh

Chanchal Singh has over half decades experience in Product Development and Architect Design. He has been working very closely with leadership team of various companies including directors ,CTO's and Founding members to define technical road-map for company.He is the Founder and Speaker at meetup group Big Data and AI Pune MeetupExperience Speaks. He is Co-Author of Book Building Data Streaming Application with Apache Kafka. He has a Bachelor's degree in Information Technology from the University of Mumbai and a Master's degree in Computer Application from Amity University. He was also part of the Entrepreneur Cell in IIT Mumbai. His Linkedin Profile can be found at with the username Chanchal Singh.

See other products by Chanchal Singh

Manish Kumar

Manish Kumar works as Director of Technology and Architecture at VSquare. He has over 13 years' experience in providing technology solutions to complex business problems. He has worked extensively on web application development, IoT, big data, cloud technologies, and blockchain. Aside from this book, Manish has co-authored three books (Mastering Hadoop 3, Artificial Intelligence for Big Data, and Building Streaming Applications with Apache Kafka).

See other products by Manish Kumar