Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product typeBook

Published inMay 2017

PublisherPackt

ISBN-139781787126732

Edition1st Edition

Tools

Hadoop

Concepts

System Administration

Author (1)

Aman Singh

Chapter 8. Performance Tuning

In this chapter, we will cover the following recipes:

Tuning the operating system
Tuning the disk
Tuning the network
Tuning HDFS
Tuning Namenode
Tuning Datanode
Configuring YARN for performance
Configuring MapReduce for performance
Hive performance tuning
Benchmarking Hadoop cluster

In this chapter, we will configure a Hadoop cluster with different parameters and see its effect on performance. There is no one way of doing things and if a particular setting works on one cluster, it does not necessarily mean that it will work for the other cluster with different hardware or work load.

Note

This being a recipe book, we will not be covering a lot of theory, but it is recommended to build a background on the things we are going to do in this chapter, rather than simply changing the values.

As stated initially, the performance may vary from one system to another and in many cases, it is just context. When someone says that the system is slow, what does it mean? Slower than what...

Tuning the operating system

In Hadoop, we mostly use Linux-based operating systems, so the settings we talk about will be restricted to any Linux-based systems.

The first important thing to consider is making sure that the hardware is optimal with latest drivers for motherboard components and the right kind of memory modules with matching bus speed. The BIOS settings are tuned to be optimal like disable power saving mode, VT flag enabled, 64-bit architecture, the right cabling for disk enclosures (Just a bunk of disks (JBOD)). Multiple CPUs with at least a quad core per CPU socket and high bandwidth bonded interface cards. Racks with support for 1U or 2U servers, with rack top switches which can support network traffic from a large Hadoop cluster.

The hardware configuration will vary according to the Hadoop components like whether it is a Namenode, Datanode, HBase master, or region server. Also, whether the work load is I/O intensive or CPU intensive. There will always be a race between right...

Tuning the disk

In this recipe, we will tune the disk drives to give the optimal performance. For I/O bound workloads like sorting, indexing, data movement disks, and network play an important role and need to be addressed in the right manner.

The workload conditions on a Datanode will be different from that of a Namenode or that of a database running a MySQL metastore. The changes mentioned in the following recipe are valid for all nodes, unless explicitly mentioned otherwise.

Getting ready

To step through the recipe in this section, we need at least one node to test and to make the configuration first, and the same can be applied to nodes within the same categories of master nodes or Datanodes. It is recommended to read Chapter 10, Cluster Planning, to get an idea about the cluster layout.

How to do it...

Connect to a node which at a later stage will be used to install Hadoop. We are using the node, master1.cyrus.com.
Switch to root user or have sudo privileges.
Make sure that you have different...

Tuning the network

In this recipe, we will look at tuning the network for better performance. This recipe is very much limited to the operating system parameters and not the optimization of routers or switches.

Getting ready

To step through the recipe in this section, we need at least one node to test and to make the configuration changes, and the same can be applied to all the nodes in the cluster.

How to do it...

Connect to a node which at a later stage will be used to install Hadoop. We are using the node master1.cyrus.com.
Switch as user root or have sudo privileges.
Edit the /etc/sysctl.conf file to tune parameters which affect the network performance. The parameters shown in the next steps need to be changed in this file.
Change the port range by adding the following line:
```
net.ipv4.ip_local_port_range = 1024 65535
```
Enable TCP socket reuse and recycle by using the following line:
```
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
```
Tune the SYN backlog queue by adjusting the following values....

Tuning HDFS

In the previous few recipes, we tuned the operating system, disks, and network setting for the installation of Hadoop.

In this recipe, we will tune HDFS for best performance. As stated initially, the HDFS read/write performance on a node with a slow disk and resource constraints will be slower compared to a node having a faster disk, CPU, and RAM. Tuning is always a layered approach, tuning each layer in conjunction to come to a final result.

Getting ready

To complete the recipe, the user must have a running cluster with HDFS and YARN setup. Users can refer to Chapter 1, Hadoop Architecture and Deployment, on installation details.

The assumption here is that the user is well familiar with HDFS concepts and knows its layout. Please read the Tuning the disk recipe, as HDFS will be a pseudo-file system of a native EXT4 or XFS filesystem.

How to do it...

Connect to the Namenode master1.cyrus.com and switch to user hadoop.
Edit the file hdfs-site.xml and change the HDFS block size to be...

Tuning Namenode

In this recipe, we will look at tuning Namenode by making some important configuration changes. Namenode is more CPU and memory bound and must run on hardware with multi-core CPU and large memory to accommodate the entire namespace.

We will look at parameters only for the Namenode, which in production will come into effect in conjunction with HDFS and Datanode parameters, discussed in this chapter.

Getting ready

To complete the recipe, the user must have a running cluster with HDFS and YARN setup. Users can refer to Chapter 1, Hadoop Architecture and Deployment, for installation details.

The assumption here is that the user is well familiar with Namenode functionality and can edit and restart services for changes to be effective.

Note

It is recommended that users explore the load characteristics of Namenode and understand its memory usage, thread count, and GC cycle.

How to do it...

Connect to the master node master1.cyrus.com and switch to the hadoop user.
The first thing to make...

Tuning Datanode

In this recipe, we will look at tuning the Datanode by making some important configuration changes. Datanodes are mostly I/O bound, but can have a varied workload for HBase region servers. The network throughout the disks must be tuned for optimal performance.

We will look at parameters only for the Datanode, which in production will come into effect in conjunction with HDFS and Namenode parameters, discussed earlier in this chapter.

Getting ready

For this recipe, you will again need a running cluster and have at least the HDFS daemons running in the cluster.

How to do it...

Connect to the master node master1.cyrus.com and switch to user hadoop.
The hdfs-site.xml file will remain the same in the cluster. Each of the Namenode and Datanode daemons will read its respective parameters, ignoring the others.
Tune the Datanode handler count by using the following configuration in the hdfs-site.xml file:
```
<property>
<name>dfs.datanode.handler.count</name>
<value>40...
```

Configuring YARN for performance

Another important component to tune is the YARN framework. Until now, we have concentrated on the HDFS/storage layer, but we need to tune the scheduler and compute the layer as well.

In this recipe, we will see which important properties to take care of and how they can be optimized. To get a picture of the YARN layout and to correlate things better, please refer to the following diagram:

Getting ready

Make sure that the user has a running cluster with HDFS and YARN configured. The user must be able to execute HDFS and YARN commands. Please refer to Chapter 1, Hadoop Architecture and Deployment, for Hadoop installation and configuration.

How to do it...

Connect to the Namenode master1.cyrus.com and switch to the hadoop user.
The important file for this recipe is yarn-site.xml and all the parameters in the following steps will be part of it.
The memory on the system after accounting for the operating system, any daemons like Namenode or Datanodes, and HBase regions...

Configuring MapReduce for performance

In this recipe, we will touch upon MapReduce parameters and see how we can optimize them.

Getting ready

For this recipe, you will again need a running cluster with HDFS and YARN. Users must have completed the recipe Configuring YARN for performance recipe.

How to do it...

Connect to the master node master1.cyrus.com and switch to the hadoop user.
The file where these changes will be made is mapred-site.xml.
The first thing to adjust is to sort the buffer according to the HDFS block size. It must always be greater than the value of dfs.blocksize. This can be configured as follows:
```
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>200</value>
</property>
```
The next value to tune is the number of streams to merge while sorting. This many file handles will be open per mapper:
```
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>24</value>
</property>
```
Another important thing to take...

Hive performance tuning

In this recipe, we will cover Hive tuning by touching upon some important parameters. Hive is a data warehousing solution which runs on top of Hadoop, as discussed in Chapter 7, Data Ingestion and Workflow. Please refer to it for installation and configuration of Hive.

Getting ready

Make sure that the user has a running cluster with Hive installed and configured to run with the ZooKeeper ensemble. Users can refer to Chapter 7, Data Ingestion and Workflow on Hive, for configuring that.

How to do it...

Connect to the Edge node client1.cyrus.com and switch to the hadoop user.
If you have followed the previous recipes, Hive is installed at /opt/cluster/hive on the Edge node.
The first thing is to tune the JVM heap used, when Hive is started by the shell as shown in the following screenshot, to the file hive-env.sh file:
Configure the local Hive scratch space on a separate disk by using the following configuration:
```
<property>
<name>hive.exec.local.scratchdir</name...
```

Benchmarking Hadoop cluster

It is important to benchmark so as to have a baseline to do comparisons after making changes. In this recipe, we will look at some of the benchmarks which can help to profile the changes committed.

Before running any tests for the changed parameters, make sure to enable verbose logging and also enable GC logs for all the components by using -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:${LOG_DIR}/gc-{component}.log-$(date +'%Y%m%d%H%M').

Getting ready

Make sure that the user has a running cluster with HDFS and YARN fully functional in a multi-node cluster.

All these tests must be run first without making any changes to the cluster and then optimizing parameters, discussed in the preceding recipes, and again running the benchmarking test.

How to do it...

Connect to the Edge node client1.cyrus.com or master node and change to the Hadoop user.

All test output will be written to the location /bencharks on HDFS, under respective test...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop 2.x Administration Cookbook

Published in: May 2017Publisher: PacktISBN-13: 9781787126732

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Other recommended products

Related to this chapter

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Mastering Apache Storm

With real-world examples and clear explanations, this book will ensure you will have a thorough mastery Apache Storm.You’ll get an understanding of deploying Storm on clusters. Introduce yourself to topics such as trident topology, monitoring, Storm Parallelism, scheduler and log processing. Learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm.You will be able to use the knowledge to develop efficient, distributed real-time applications to cater to your business needs.

BookAug 2017284 pages

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Personalised recommendations for you

Based on your interests and search pattern

Designing and Implementing Microsoft Azure Networking Solutions

Designing and Implementing Microsoft Azure Networking Solutions Exam Ref AZ-700 is an all-encompassing guide to the AZ-700 exam and contains all the information you need to succeed in the world of virtual networking with Azure. With this book, you will be fully prepared for the exam and the world of cloud networking.

BookAug 2023524 pages

Microsoft 365 Security, Compliance, and Identity Administration

The Microsoft 365 Security, Compliance, and Identity Administration is a comprehensive guide that helps you employ Microsoft 365's robust suite of features and empowers you to optimize your administrative tasks.

BookAug 2023630 pages

Zero Trust Overview and Playbook Introduction

Get started on Zero Trust with this step-by-step playbook and learn everything you need to know for a successful Zero Trust journey with tailored guidance for every role, covering strategy, operations, architecture, implementation, and measuring success. This book will become an indispensable reference for everyone in your organization.

BookOct 2023240 pages

The Self-Taught Cloud Computing Engineer

This self-study book helps you master multiple clouds, including AWS, Azure, and GCP, and serves as a roadmap to becoming a certified cloud computing expert. The book will guide you to develop a professional cloud career by helping you build a broad cloud knowledge base, developing hands-on cloud computing skills, and getting cloud certified.

BookSep 2023472 pages

Technology Operating Models for Cloud and Edge

This book will help you build and create ownership of a technology operating model, as well as connect your leadership with engineering and operations, keeping your internal and external customers in mind. It provides practical tips on why, where, and how to make the cloud and edge platform paradigm sing for you, your team, and your organization.

BookAug 2023228 pages

Azure Architecture Explained

Azure is the preferred platform to build mission-critical and secure apps. This book provides comprehensive coverage of essential Azure products, services, and solutions vital for every solution architect's success. Elevate your knowledge and master the critical components of Azure to excel in your role with Azure Architecture Explained.

BookSep 2023446 pages

Pentesting Active Directory and Windows-based Infrastructure

This practical guide helps you explore the pentesting of Microsoft infrastructure in detail, and enhances your offensive skillset by showing you the different ways to perform security assessment. This book will help blue teamers and IT engineers get up to speed with possible security issues they may encounter in their Windows environments.

BookNov 2023360 pages

Practical Ansible

In Practical Ansible, you'll work with the latest release of Ansible and learn to solve complex issues quickly with the help of task-oriented scenarios. You'll start by installing and configuring Ansible to automate monotonous and repetitive IT tasks and get to grips with concepts such as playbooks, inventories, plugins, collections, and network modules.

BookSep 2023420 pages

Windows 11 for Enterprise Administrators

Microsoft’s launch of Windows 11 is a step toward satisfying the enterprise administrator’s needs for better management and enhanced user experience customization. This book provides the enterprise administrator with the knowledge needed to fully utilize the advanced feature set of Windows 11 Enterprise.

BookOct 2023286 pages

The Linux DevOps Handbook

This book is for software and IT professionals seeking knowledge on Linux systems and DevOps practices. This book will provide you with guidance and tools to learn and gain proficiency in managing Linux-based infrastructures and knowledge of DevOps.

BookNov 2023428 pages2