Packt+ | Advance your knowledge in tech

You're reading from Optimizing Hadoop for MapReduce

Product typeBook

Published inFeb 2014

Publisher

ISBN-139781783285655

Edition1st Edition

Tools

Hadoop

Concepts

Data Processing

Author (1)

Khaled Tannir

Chapter 2. An Overview of the Hadoop Parameters

Once you have your Hadoop job running, it is important to know whether your cluster resources are being fully utilized. Fortunately, the Hadoop framework provides several parameters that enable you to tune your job and specify how it will run on the cluster.

Performance tuning involves four main components: CPU utilization, memory occupation, disk I/O, and network traffic. This chapter describes the most relative parameters to these components and introduces techniques to optimize Hadoop execution and define some configuration parameters.

It is important and essential to have an efficient monitoring tool, with alerts delivered when a problem is developing or occurs, which provides a visual indication of how the Hadoop cluster is and has been performing. This chapter is focused on introducing Hadoop performance tuning using configuration parameters and also introducing several tools for monitoring Hadoop services.

In this chapter, we will cover...

Investigating the Hadoop parameters

As discussed in Chapter 1, Understanding MapReduce, there are many factors that may affect the Hadoop MapReduce performance. In general, workload-dependent Hadoop performance optimization efforts have to focus on three major categories: the system hardware, the system software, and the configuration and tuning/optimization of the Hadoop infrastructure components.

It is good to point out that Hadoop is classified as a highly-scalable solution, but not necessarily as a high-performance cluster solution. Administrators can configure and tune a Hadoop cluster with various configuration options. Performance configuration parameters focus mainly on CPU utilization, memory occupation, disk I/O, and network traffic. Besides the main performance parameters of Hadoop, other system parameters such as inter-rack bandwidth may affect the overall performance of the cluster.

Hadoop can be configured and customized according to the user's needs; the configuration files...

Hadoop MapReduce metrics

Due to its scale and distributed nature, diagnosing the performance problems of Hadoop programs and monitoring a Hadoop system are inherently difficult. Although Hadoop system exports many textual metrics and logs, this information may be difficult to interpret and not fully understood by many application programmers.

Currently, Hadoop reports coarse-grained metrics about the performance of the whole system through logs and metrics API. Unfortunately, it lacks important metrics for per-job/per-task levels such as disk and network I/O utilization. In the case of running multiple jobs in a Hadoop system, it also lacks metrics to reflect the cluster resource utilization of each task. This results in difficulty for cluster administrators to measure their cluster utilization and set up the correct configuration of Hadoop systems.

Furthermore, logs generated by Hadoop can get excessively large, which makes it extremely difficult to handle them manually and can hardly answer...

Performance monitoring tools

Monitoring basic system resources on Hadoop cluster nodes such as CPU utilization and average disk data transfer rates helps to understand the overall utilization of these hardware resources and identify any bottlenecks while diagnosing performance issues. Monitoring a Hadoop cluster includes monitoring the usage of system resources on cluster nodes along with monitoring the key service metrics. The most commonly monitored resources are I/O bandwidth, number of disk I/O operations per second, average data transfer rate, network latency, and average memory and swap space utilization.

Hadoop performance monitoring suggests collecting performance counters' data in order to determine whether the response times of various tasks lie within acceptable execution time range. The average percentage utilization for MapReduce tasks and HDFS storage capacity over time indicates whether your cluster's resources are used optimally or are underused.

Hadoop offers a substantial...

Summary

In this chapter, we discussed Hadoop MapReduce performance tuning and learned how application developers and cluster administrators can tune Hadoop in order to enhance the MapReduce job's performance.

We learned about most configuration variables related to CPU, disk I/O, memory and network utilization and discussed how these variables may affect the MapReduce job's performance.

Then, we learned about Hadoop metrics and suggested some open source monitoring tools, which enhance the Hadoop monitoring experience and are very handy to Hadoop cluster administrators and application developers.

In the next chapter, we will learn how to identify resource bottlenecks based on performance indicators and also learn about common performance tuning methods.

The rest of the chapter is locked

You have been reading a chapter from

Optimizing Hadoop for MapReduce

Published in: Feb 2014Publisher: ISBN-13: 9781783285655

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Khaled Tannir

Khaled Tannir has been working with computers since 1980. He began programming with the legendary Sinclair Zx81 and later with Commodore home computer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500). He has a Bachelor's degree in Electronics, a Master's degree in System Information Architectures, in which he graduated with a professional thesis, and completed his education with a Master of Research degree. He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 years of technical experience leading the development and implementation of software solutions and giving technical presentations. He now works as an independent IT consultant and has worked as an infrastructure engineer, senior developer, and enterprise/solution architect for many companies in France and Canada. With significant experience in Microsoft .Net, Microsoft Server Systems, and Oracle Java technologies, he has extensive skills in online/offline applications design, system conversions, and multilingual applications in both domains: Internet and Desktops. He is always researching new technologies, learning about them, and looking for new adventures in France, North America, and the Middle-east. He owns an IT and electronics laboratory with many servers, monitors, open electronic boards such as Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphone devices based on Windows Phone, Android, and iOS operating systems. In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum at Bordeaux University, France) and presented, in a workshop session, his work on "how to optimize data distribution in a cloud computing environment". This work aims to define an approach to optimize the use of data mining algorithms such as k-means and Apriori in a cloud computing environment. He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing. He aims to get a PhD in Cloud Computing and Big Data and wants to learn more and more about these technologies. He enjoys taking landscape and night time photos, travelling, playing video games, creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course, spending time with his wife and family. You can reach him at contact@khaledtannir.net.
Read more about Khaled Tannir

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages