Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (23) Chapters

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Journey to Hadoop 3 Deep Dive into the Hadoop Distributed File System YARN Resource Management in Hadoop Internals of MapReduce SQL on Hadoop Real-Time Processing Engines Widely Used Hadoop Ecosystem Components Designing Applications in Hadoop Real-Time Stream Processing in Hadoop Machine Learning in Hadoop Hadoop in the Cloud Hadoop Cluster Profiling Who Can Do What in Hadoop Network and Data Security Monitoring Hadoop Other Books You May Enjoy Index

Chapter 4. Internals of MapReduce

The last chapter was all about managing resources on a Hadoop cluster and we went through details of the YARN architecture, execution, and a few examples. In this chapter, we will talk more about the MapReduce processing framework and how it has evolved over time. We will try to simplify how the overall MapReduce processing works and learn about what the major steps involved in the process are. The topics that will be covered in this chapter are as follows:

  • Deep dive into the Hadoop MapReduce framework
  • YARN and MapReduce
  • MapReduce workflow in a Hadoop framework
  • Important MapReduce parameters
  • Common MapReduce patterns
  • MapReduce examples in our use case
  • Optimizing MapReduce
  • MapReduce command reference

Technical requirements


You will be required to have Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter04

Check out the following video to see the code in action:http://bit.ly/2PY4oP

Deep dive into the Hadoop MapReduce framework


The story of Hadoop started with HDFS and MapReduce. Hadoop version 1 has the basic features for storing and processing data over a distributed platform and since then it has evolved a lot. Hadoop version 2 added major changes, such as NameNode, high availability, and a new resource management framework called YARN. However, the high-level flow for MapReduce processing did not change despite various changes in its API. 

MapReduce consists of two major steps: map and reduce, and multiple minor steps that are part of the process flow from map to reduce tasks. The mappers are responsible for performing map tasks while reducers are responsible for the reduce tasks. The job of the mapper is to process the blocks stored on HDFS, like the distributed storage system. Let's us look at the following MapReduce flow diagram:

We will understand the processing flow as follows:

  • InputFileFormat: The MapReduce process starts with reading the file stored on HDFS...

YARN and MapReduce


We have covered enough information about YARN in previous chapters. In this section, we will talk about the execution of MapReduce over YARN. The JobTracker in Hadoop version 1 has a bottleneck due to a scalability limit of 4,000 nodes. Yahoo realizes that their current requirement needs a scaling of up to 20,000 nodes. The latter was certainly not possible due to the legacy architecture of the job tracker. Yahoo then introduced YARN, which broke the function of the job tracker for efficient management. We covered the detail architecture in Chapter 3, YARN Resource Management in Hadoop

The node manager in YARN has enough memory to launch multiple containers. The application master can request any number of containers from the resource manager, which keeps track of the available resources in the YARN cluster. The job type is not limited to MapReduce; instead, YARN can launch any type of application. Let's take a look at the life cycle of a MapReduce application on YARN...

MapReduce workflow in the Hadoop framework


The MapReduce execution goes through various steps and each step has scope for a little optimization. In the previous sections, we have covered the components of the MapReduce framework and now we will briefly look into the MapReduce execution flow, which will help us understand how each component interacts with each other. The following diagram gives a brief overview about the MapReduce execution flow. We have divided the diagram into smaller parts so that each step looks easier to understand. The step numbers are mentioned over arrow connectors and the last arrow in the diagram connects to the following diagram in the section: 

We will explain the different steps of the MapReduce internal flow here as follows:

  1. The InputFormat is the starting point of any MapReduce application. It is defined in the job configuration in the Driver class of the application, for example, job.setInputFormatClass(TextInputFormat.class). The InputFormat helps in understanding...

Common MapReduce patterns


The design patterns are the solution templates for solving specific problems. Developers can reuse templates for similar problems across domains so that they save time in solving problems. If you are a programmer, you would have used the abstract factory pattern, builder pattern, observer pattern, and so on before. These patterns are discovered by people who have been solving similar problems for many years. The MapReduce framework has existed for almost a decade now. Let's look into a few of the commonly used MapReduce design patterns across industries.

Summarization patterns

Summarization problems use the pattern widely across domains. It's all about grouping similar data together and then performing an operation such as calculating a minimum, maximum, count, average, median-standard deviation, building an index, or just simply counting based on key. For example, we might want to calculate the total amount of money our website has made by country. As another example...

MapReduce use case


We will cover a use case to find out the top 20 highly rated movies and will consider the condition that movies should have been rated by more than 100 people. The filter pattern we discussed earlier is a good fit for the use case. The format of the data is as follows:

title

averageRating

numVotes

tt0000001

5.8

1374

 

The title code refers to the specific movie. The rating is based on a 10 point scale. Let's look into the mapper, reduce code, and driver code. The template can also be used for similar use cases.

 

 

MovieRatingMapper

The job of the mapper is to process the record and emit the top 20 records it has processed for input split. We are also filtering out movies that have not been rated by at least 100 people. The code is as follows:

import org.apache.Hadoop.io.LongWritable;
import org.apache.Hadoop.io.Text;
import org.apache.Hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;

public class MovieRatingMapper extends
Mapper...

Optimizing MapReduce


The MapReduce framework provides a massive advantage for improving performance for large datasets as we can add more nodes to get more performance. The resources such as node, memory, and disk require significant investment, thus only adding the node should not be a parameter for performance optimization. Sometimes, adding more nodes does not help in getting more performance as the application performance could be something else, such as code optimization, unwanted data transfer, and so on. In this section, we will discuss some of the best practices to optimize the MapReduce application. 

The performance of the application is measured by the overall processing time taken by the application. MapReduce processes data in parallel and thus it already provides a performance advantage over your MapReduce application. The following factors play important roles in optimizing MapReduce performance.

 

 

Hardware configuration

Hardware setup is the first step in the Hadoop installation...

Summary


In this chapter, we learned about MapReduce processing and overall processing works internally. We also looked into how MapReduce jobs are submitted to YARN and how YARN works to make sure your MapReduce job runs with efficiency and delivers status to you on successful completion. In the latter half of the chapter, we covered different design patterns that are commonly used in the industry and also covered basic templates to use these patterns. In the next chapter, we will look into various Hadoop components added for efficient processing over HDFS. We will also look into some of the SQL engines used for data analytics and report generation.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}