Optimizing Hadoop for MapReduce

Optimizing Hadoop for MapReduce
eBook: $20.99
Formats: PDF, PacktLib, ePub and Mobi formats
save 15%!
Print + free eBook + free PacktLib access to the book: $55.98    Print cover: $34.99
save 6%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Table of Contents
Sample Chapters
  • Optimize your MapReduce job performance
  • Identify your Hadoop cluster’s weaknesses
  • Tune your MapReduce configuration

Book Details

Language : English
Paperback : 120 pages [ 235mm x 191mm ]
Release Date : February 2014
ISBN : 1783285656
ISBN 13 : 9781783285655
Author(s) : Khaled Tannir
Topics and Technologies : All Books, Big Data and Business Intelligence, Web Development, Open Source

Table of Contents

Chapter 1: Understanding Hadoop MapReduce
Chapter 2: An Overview of the Hadoop Parameters
Chapter 3: Detecting System Bottlenecks
Chapter 4: Identifying Resource Weaknesses
Chapter 5: Enhancing Map and Reduce Tasks
Chapter 6: Optimizing MapReduce Tasks
Chapter 7: Best Practices and Recommendations
  • Chapter 2: An Overview of the Hadoop Parameters
    • Investigating the Hadoop parameters
      • The mapred-site.xml configuration file
        • The CPU-related parameters
        • The disk I/O related parameters
        • The memory-related parameters
        • The network-related parameters
      • The hdfs-site.xml configuration file
      • The core-site.xml configuration file
    • Hadoop MapReduce metrics
    • Performance monitoring tools
      • Using Chukwa to monitor Hadoop
      • Using Ganglia to monitor Hadoop
      • Using Nagios to monitor Hadoop
      • Using Apache Ambari to monitor Hadoop
    • Summary
  • Chapter 3: Detecting System Bottlenecks
    • Performance tuning
    • Creating a performance baseline
    • Identifying resource bottlenecks
      • Identifying RAM bottlenecks
      • Identifying CPU bottlenecks
      • Identifying storage bottlenecks
      • Identifying network bandwidth bottlenecks
    • Summary
  • Chapter 4: Identifying Resource Weaknesses
    • Identifying cluster weakness
      • Checking the Hadoop cluster node's health
      • Checking the input data size
      • Checking massive I/O and network traffic
      • Checking for insufficient concurrent tasks
      • Checking for CPU contention
    • Sizing your Hadoop cluster
    • Configuring your cluster correctly
    • Summary
  • Chapter 5: Enhancing Map and Reduce Tasks
    • Enhancing map tasks
      • Input data and block size impact
      • Dealing with small and unsplittable files
      • Reducing spilled records during the Map phase
      • Calculating map tasks' throughput
    • Enhancing reduce tasks
      • Calculating reduce tasks' throughput
      • Improving Reduce execution phase
    • Tuning map and reduce parameters
    • Summary
  • Chapter 7: Best Practices and Recommendations
    • Hardware tuning and OS recommendations
      • The Hadoop cluster checklist
      • The Bios tuning checklist
      • OS configuration recommendations
    • Hadoop best practices and recommendations
      • Deploying Hadoop
      • Hadoop tuning recommendations
      • Using a MapReduce template class code
    • Summary

Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.

Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Optimizing Hadoop for MapReduce +    Machine Learning with R =
50% Off
the second eBook
Price for both: $38.55

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Learn about the factors that affect MapReduce performance
  • Utilize the Hadoop MapReduce performance counters to identify resource bottlenecks
  • Size your Hadoop cluster’s nodes
  • Set the number of mappers and reducers correctly
  • Optimize mapper and reducer task throughput and code size using compression and Combiners
  • Understand the various tuning properties and best practices to optimize clusters

In Detail

MapReduce is the distribution system that the Hadoop MapReduce engine uses to distribute work around a cluster by working parallel on smaller data sets. It is useful in a wide range of applications, including distributed pattern-based searching, distributed sorting, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.

This book introduces you to advanced MapReduce concepts and teaches you everything from identifying the factors that affect MapReduce job performance to tuning the MapReduce configuration. Based on real-world experience, this book will help you to fully utilize your cluster’s node resources to run MapReduce jobs optimally.

This book details the Hadoop MapReduce job performance optimization process. Through a number of clear and practical steps, it will help you to fully utilize your cluster’s node resources.

Starting with how MapReduce works and the factors that affect MapReduce performance, you will be given an overview of Hadoop metrics and several performance monitoring tools. Further on, you will explore performance counters that help you identify resource bottlenecks, check cluster health, and size your Hadoop cluster. You will also learn about optimizing map and reduce tasks by using Combiners and compression.

The book ends with best practices and recommendations on how to use your Hadoop cluster optimally.


This book is an example-based tutorial that deals with Optimizing Hadoop for MapReduce job performance.

Who this book is for

If you are a Hadoop administrator, developer, MapReduce user, or beginner, this book is the best choice available if you wish to optimize your clusters and applications. Having prior knowledge of creating MapReduce applications is not necessary, but will help you better understand the concepts and snippets of MapReduce class template code.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software