Reader small image

You're reading from  Hadoop 2.x Administration Cookbook

Product typeBook
Published inMay 2017
PublisherPackt
ISBN-139781787126732
Edition1st Edition
Tools
Right arrow
Author (1)
Aman Singh
Aman Singh
author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh

Right arrow

Chapter 8. Performance Tuning

In this chapter, we will cover the following recipes:

  • Tuning the operating system

  • Tuning the disk

  • Tuning the network

  • Tuning HDFS

  • Tuning Namenode

  • Tuning Datanode

  • Configuring YARN for performance

  • Configuring MapReduce for performance

  • Hive performance tuning

  • Benchmarking Hadoop cluster

In this chapter, we will configure a Hadoop cluster with different parameters and see its effect on performance. There is no one way of doing things and if a particular setting works on one cluster, it does not necessarily mean that it will work for the other cluster with different hardware or work load.

Note

This being a recipe book, we will not be covering a lot of theory, but it is recommended to build a background on the things we are going to do in this chapter, rather than simply changing the values.

As stated initially, the performance may vary from one system to another and in many cases, it is just context. When someone says that the system is slow, what does it mean? Slower than what...

Tuning the operating system


In Hadoop, we mostly use Linux-based operating systems, so the settings we talk about will be restricted to any Linux-based systems.

The first important thing to consider is making sure that the hardware is optimal with latest drivers for motherboard components and the right kind of memory modules with matching bus speed. The BIOS settings are tuned to be optimal like disable power saving mode, VT flag enabled, 64-bit architecture, the right cabling for disk enclosures (Just a bunk of disks (JBOD)). Multiple CPUs with at least a quad core per CPU socket and high bandwidth bonded interface cards. Racks with support for 1U or 2U servers, with rack top switches which can support network traffic from a large Hadoop cluster.

The hardware configuration will vary according to the Hadoop components like whether it is a Namenode, Datanode, HBase master, or region server. Also, whether the work load is I/O intensive or CPU intensive. There will always be a race between right...

Tuning the disk


In this recipe, we will tune the disk drives to give the optimal performance. For I/O bound workloads like sorting, indexing, data movement disks, and network play an important role and need to be addressed in the right manner.

The workload conditions on a Datanode will be different from that of a Namenode or that of a database running a MySQL metastore. The changes mentioned in the following recipe are valid for all nodes, unless explicitly mentioned otherwise.

Getting ready

To step through the recipe in this section, we need at least one node to test and to make the configuration first, and the same can be applied to nodes within the same categories of master nodes or Datanodes. It is recommended to read Chapter 10, Cluster Planning, to get an idea about the cluster layout.

How to do it...

  1. Connect to a node which at a later stage will be used to install Hadoop. We are using the node, master1.cyrus.com.

  2. Switch to root user or have sudo privileges.

  3. Make sure that you have different...

Tuning the network


In this recipe, we will look at tuning the network for better performance. This recipe is very much limited to the operating system parameters and not the optimization of routers or switches.

Getting ready

To step through the recipe in this section, we need at least one node to test and to make the configuration changes, and the same can be applied to all the nodes in the cluster.

How to do it...

  1. Connect to a node which at a later stage will be used to install Hadoop. We are using the node master1.cyrus.com.

  2. Switch as user root or have sudo privileges.

  3. Edit the /etc/sysctl.conf file to tune parameters which affect the network performance. The parameters shown in the next steps need to be changed in this file.

  4. Change the port range by adding the following line:

    net.ipv4.ip_local_port_range = 1024 65535
  5. Enable TCP socket reuse and recycle by using the following line:

    net.ipv4.tcp_tw_recycle = 1
    net.ipv4.tcp_tw_reuse = 1
  6. Tune the SYN backlog queue by adjusting the following values....

Tuning HDFS


In the previous few recipes, we tuned the operating system, disks, and network setting for the installation of Hadoop.

In this recipe, we will tune HDFS for best performance. As stated initially, the HDFS read/write performance on a node with a slow disk and resource constraints will be slower compared to a node having a faster disk, CPU, and RAM. Tuning is always a layered approach, tuning each layer in conjunction to come to a final result.

Getting ready

To complete the recipe, the user must have a running cluster with HDFS and YARN setup. Users can refer to Chapter 1, Hadoop Architecture and Deployment, on installation details.

The assumption here is that the user is well familiar with HDFS concepts and knows its layout. Please read the Tuning the disk recipe, as HDFS will be a pseudo-file system of a native EXT4 or XFS filesystem.

How to do it...

  1. Connect to the Namenode master1.cyrus.com and switch to user hadoop.

  2. Edit the file hdfs-site.xml and change the HDFS block size to be...

Tuning Namenode


In this recipe, we will look at tuning Namenode by making some important configuration changes. Namenode is more CPU and memory bound and must run on hardware with multi-core CPU and large memory to accommodate the entire namespace.

We will look at parameters only for the Namenode, which in production will come into effect in conjunction with HDFS and Datanode parameters, discussed in this chapter.

Getting ready

To complete the recipe, the user must have a running cluster with HDFS and YARN setup. Users can refer to Chapter 1, Hadoop Architecture and Deployment, for installation details.

The assumption here is that the user is well familiar with Namenode functionality and can edit and restart services for changes to be effective.

Note

It is recommended that users explore the load characteristics of Namenode and understand its memory usage, thread count, and GC cycle.

How to do it...

  1. Connect to the master node master1.cyrus.com and switch to the hadoop user.

    The first thing to make...

Tuning Datanode


In this recipe, we will look at tuning the Datanode by making some important configuration changes. Datanodes are mostly I/O bound, but can have a varied workload for HBase region servers. The network throughout the disks must be tuned for optimal performance.

We will look at parameters only for the Datanode, which in production will come into effect in conjunction with HDFS and Namenode parameters, discussed earlier in this chapter.

Getting ready

For this recipe, you will again need a running cluster and have at least the HDFS daemons running in the cluster.

How to do it...

  1. Connect to the master node master1.cyrus.com and switch to user hadoop.

  2. The hdfs-site.xml file will remain the same in the cluster. Each of the Namenode and Datanode daemons will read its respective parameters, ignoring the others.

  3. Tune the Datanode handler count by using the following configuration in the hdfs-site.xml file:

    <property>
    <name>dfs.datanode.handler.count</name>
    <value>40...

Configuring YARN for performance


Another important component to tune is the YARN framework. Until now, we have concentrated on the HDFS/storage layer, but we need to tune the scheduler and compute the layer as well.

In this recipe, we will see which important properties to take care of and how they can be optimized. To get a picture of the YARN layout and to correlate things better, please refer to the following diagram:

Getting ready

Make sure that the user has a running cluster with HDFS and YARN configured. The user must be able to execute HDFS and YARN commands. Please refer to Chapter 1, Hadoop Architecture and Deployment, for Hadoop installation and configuration.

How to do it...

  1. Connect to the Namenode master1.cyrus.com and switch to the hadoop user.

  2. The important file for this recipe is yarn-site.xml and all the parameters in the following steps will be part of it.

  3. The memory on the system after accounting for the operating system, any daemons like Namenode or Datanodes, and HBase regions...

Configuring MapReduce for performance


In this recipe, we will touch upon MapReduce parameters and see how we can optimize them.

Getting ready

For this recipe, you will again need a running cluster with HDFS and YARN. Users must have completed the recipe Configuring YARN for performance recipe.

How to do it...

  1. Connect to the master node master1.cyrus.com and switch to the hadoop user.

  2. The file where these changes will be made is mapred-site.xml.

  3. The first thing to adjust is to sort the buffer according to the HDFS block size. It must always be greater than the value of dfs.blocksize. This can be configured as follows:

    <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>200</value>
    </property>
  4. The next value to tune is the number of streams to merge while sorting. This many file handles will be open per mapper:

    <property>
    <name>mapreduce.task.io.sort.factor</name>
    <value>24</value>
    </property>
  5. Another important thing to take...

Hive performance tuning


In this recipe, we will cover Hive tuning by touching upon some important parameters. Hive is a data warehousing solution which runs on top of Hadoop, as discussed in Chapter 7, Data Ingestion and Workflow. Please refer to it for installation and configuration of Hive.

Getting ready

Make sure that the user has a running cluster with Hive installed and configured to run with the ZooKeeper ensemble. Users can refer to Chapter 7, Data Ingestion and Workflow on Hive, for configuring that.

How to do it...

  1. Connect to the Edge node client1.cyrus.com and switch to the hadoop user.

  2. If you have followed the previous recipes, Hive is installed at /opt/cluster/hive on the Edge node.

  3. The first thing is to tune the JVM heap used, when Hive is started by the shell as shown in the following screenshot, to the file hive-env.sh file:

  4. Configure the local Hive scratch space on a separate disk by using the following configuration:

    <property>
    <name>hive.exec.local.scratchdir</name...

Benchmarking Hadoop cluster


It is important to benchmark so as to have a baseline to do comparisons after making changes. In this recipe, we will look at some of the benchmarks which can help to profile the changes committed.

Before running any tests for the changed parameters, make sure to enable verbose logging and also enable GC logs for all the components by using -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:${LOG_DIR}/gc-{component}.log-$(date +'%Y%m%d%H%M').

Getting ready

Make sure that the user has a running cluster with HDFS and YARN fully functional in a multi-node cluster.

All these tests must be run first without making any changes to the cluster and then optimizing parameters, discussed in the preceding recipes, and again running the benchmarking test.

How to do it...

Connect to the Edge node client1.cyrus.com or master node and change to the Hadoop user.

All test output will be written to the location /bencharks on HDFS, under respective test...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017Publisher: PacktISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Aman Singh

Gurmukh Singh is a seasoned technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo. He has authored Monitoring Hadoop by Packt Publishing
Read more about Aman Singh