Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce

Product type Book

Published in Feb 2014

Publisher

ISBN-13 9781783285655

Pages 120 pages

Edition 1st Edition

Languages

Concepts

Data Processing

Author (1):

Khaled Tannir

Table of Contents (15) Chapters

Optimizing Hadoop for MapReduce

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Understanding Hadoop MapReduce

An Overview of the Hadoop Parameters

Detecting System Bottlenecks

Identifying Resource Weaknesses

Enhancing Map and Reduce Tasks

Optimizing MapReduce Tasks

Best Practices and Recommendations

Index

Chapter 5. Enhancing Map and Reduce Tasks

The Hadoop framework already includes several counters such as the number of bytes read and written. These counters are very helpful to learn about the framework activities and the resources used. These counters are sent by the worker nodes to the master nodes periodically.

In this chapter, for both map and reduce, we will learn how to enhance each phase, what counters to look at, and the techniques to apply in order to analyze a performance issue. Then, you will learn how to tune the correct configuration parameter with the appropriate value.

In this chapter, we will cover the following topics:

The impact of the block size and input data
How to deal with small and unsplittable files
Reducing map-side spilling records
Improving the Reduce phase
Calculating Map and Reduce tasks' throughput
Tuning map and reduce parameters

Enhancing map tasks

When executing a MapReduce job, the Hadoop framework will execute the job in a well-defined sequence of processing phases. Except the user-defined functions (map, reduce, and combiner), the execution time of other MapReduce phases are generic across different MapReduce jobs. The processing time mainly depends on the amount of data flowing through each phase and the performance of the underlying Hadoop cluster.

In order to enhance MapReduce performance, you first need to benchmark these different phases by running a set of different jobs with different amounts of data (per map/reduce tasks). Running these jobs is needed to collect measurements such as durations and data amount for each phase, and then analyze these measurements (for each of the phases) to derive the platform scaling functions.

To identify map-side bottlenecks, you should outline five phases of the map task execution flow. The following figure represents the map tasks' execution sequence:

Let us see what each...

Enhancing reduce tasks

Reduce task processing consists of a sequence of three phases. Only the execution of the user-defined reduce function is custom, and its duration depends on the amount of data flowing through each phase and the performance of the underlying Hadoop cluster. Profiling each of these phases will help you to identify potential bottlenecks and low speeds of data processing. The following figure shows the three major phases of Reduce tasks:

Let's see each phase in some detail:

Profiling the Shuffle phase implies that you need to measure the time taken to transfer the intermediate data from map tasks to the reduce tasks and then merge and sort them together. In the shuffle phase, the intermediate data generated by the map phase is fetched. The processing time of this phase significantly depends on Hadoop configuration parameters and the amount of intermediate data that is destined for the reduce task.
In the Reduce phase, each reduce task is assigned a partition of the map output...

Tuning map and reduce parameters

Picking the right amount of tasks for a job can have a huge impact on Hadoop's performance. In Chapter 4, Identifying Resource Weaknesses, you learned how to configure the number of mappers and reducers correctly. But sizing the number of mappers and reducers correctly is not enough to get the maximum performance of a MapReduce job. The optimum occurs when every machine in the cluster has something to do at any given time when a job is executed. Remember that Hadoop framework has more than 180 parameters and most of them should not keep their default settings.

In this section, we will present other techniques to calculate your mappers' and reducers' numbers. It may be more productive to try more than one optimization method, because we aim to find a particular configuration for a given job that uses all available resources on your cluster. The outcome of this change is to enable the user to run as many mappers and reducers in parallel as possible to fully...

Summary

In this chapter, we learned about map-side and reduce-side tasks' enhancement and introduced some techniques that may help you to improve the performance of your MapReduce job. We learned how important the impact of the block size is and how to identify slow map-side performance due to small and unsplittable files. Also, we learned about spilling files and how to eliminate them by allocating an accurate amount of memory buffer.

Then, we moved ahead and learned how to identify a low performance job within the Shuffle and Merge steps during the Reduce phase. In the last section, we covered different techniques to calculate mappers' and reducers' numbers to tune your MapReduce configuration and enhance its performance.

In the next chapter, we will learn more about the optimization of MapReduce task and take a look at how combiners and intermediate data compression will enhance the MapReduce job performance. Keep reading!