Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
HBase Administration Cookbook

You're reading from  HBase Administration Cookbook

Product type Book
Published in Aug 2012
Publisher Packt
ISBN-13 9781849517140
Pages 332 pages
Edition 1st Edition
Languages
Author (1):
Yifeng Jiang Yifeng Jiang
Profile icon Yifeng Jiang

Table of Contents (16) Chapters

HBase Administration Cookbook
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Setting Up HBase Cluster Data Migration Using Administration Tools Backing Up and Restoring HBase Data Monitoring and Diagnosis Maintenance and Security Troubleshooting Basic Performance Tuning Advanced Configurations and Tuning

Chapter 8. Basic Performance Tuning

In this chapter, we will cover:

  • Setting up Hadoop to spread disk I/O

  • Using a network topology script to make the Hadoop rack-aware

  • Mounting disks with noatime and nodiratime

  • Setting vm.swappiness to 0 to avoid swap

  • Java GC and HBase heap settings

  • Using compression

  • Managing compactions

  • Managing a region split

Introduction


Performance is one of the most interesting characteristics of an HBase cluster's behavior. It is a challenging operation for administrators, because performance tuning requires deep understanding of not only HBase but also of Hadoop, Java Virtual Machine Garbage Collection (JVM GC), and important tuning parameters of an operating system.

The structure of a typical HBase cluster is shown in the following diagram:

There are several components in the cluster—the ZooKeeper cluster, the HBase master node, region servers, the Hadoop Distributed File System (HDFS) and the HBase client.

The ZooKeeper cluster acts as a coordination service for the entire HBase cluster, handling master selection, root region server lookup, node registration, and so on. The master node does not do heavy tasks. Its job includes region allocation and failover, log splitting, and load balancing. Region servers hold the actual regions; they handle I/O requests to the hosting regions, flush the in-memory...

Setting up Hadoop to spread disk I/O


Modern servers usually have multiple disk devices to provide large storage capacities. These disks are usually configured as RAID arrays, as their factory settings. This is good for many cases but not for Hadoop.

The Hadoop slave node stores HDFS data blocks and MapReduce temporary files on its local disks. These local disk operations benefit from using multiple independent disks to spread disk I/O.

In this recipe, we will describe how to set up Hadoop to use multiple disks to spread its disk I/O.

Getting ready

We assume you have multiple disks for each DataNode node. These disks are in a JBOD (Just a Bunch Of Disks) or RAID0 configuration. Assume that the disks are mounted at /mnt/d0, /mnt/d1, ..., /mnt/dn, and the user who starts HDFS has write permission on each mount point.

How to do it...

In order to set up Hadoop to spread disk I/O, follow these instructions:

  1. 1. On each DataNode node, create directories on each disk for HDFS to store its data blocks...

Using network topology script to make Hadoop rack-aware


Hadoop has the concept of "Rack Awareness". Administrators are able to define the rack of each DataNode in the cluster. Making Hadoop rack-aware is extremely important because:

  • Rack awareness prevents data loss

  • Rack awareness improves network performance

In this recipe, we will describe how to make Hadoop rack-aware and why it is important.

Getting ready

You will need to know the rack to which each of your slave nodes belongs. Log in to the master node as the user who started Hadoop.

How to do it...

The following steps describe how to make Hadoop rack-aware:

  1. 1. Create a topology.sh script and store it under the Hadoop configuration directory. Change the path for topology.data, in line 3, to fit your environment:

    hadoop@master1$ vi $HADOOP_HOME/conf/topology.sh
    while [ $# -gt 0 ] ; do
    nodeArg=$1
    exec< /usr/local/hadoop/current/conf/topology.data
    result=""
    while read line ; do
    ar=( $line )
    if [ "${ar[0]}" = "$nodeArg" ] ; then
    result=...

Mounting disks with noatime and nodiratime


If you are mounting disks purely for Hadoop and you use ext3 or ext4, or the XFS file system, we recommend that you mount the disks with the noatime and nodiratime attributes.

If you mount the disks as noatime, the access timestamps are not updated when a file is read on the filesystem. In the case of the nodiratime attribute, mounting disks does not update the directory inode access times on the filesystem. As there is no more disk I/O for updating the access timestamps, this speeds up filesystem reads.

In this recipe, we will describe why the noatime and nodiratime options are recommended for Hadoop, and how to mount disks with noatime and nodiratime.

Getting ready

You will need root privileges on your slave nodes. We assume you have two disks for only Hadoop—/dev/xvdc and /dev/xvdd. The two disks are mounted at /mnt/is1 and /mnt/is2, respectively. Also, we assume you are using the ext3 filesystem.

How to do it...

To mount disks with noatime and...

Setting vm.swappiness to 0 to avoid swap


Linux moves the memory pages that have not been accessed for some time to the swap space, even if there is enough free memory available. This is called swap out. On the other hand, reading swapped out data from the swap space to memory is called swap in. Swapping is necessary in many situations, but as Java Virtual Machine (JVM) does not behave well under swapping, HBase may run into trouble if swapped. The ZooKeeper session expiring is a typical problem that may be introduced by a swap.

In this recipe, we will describe how to tune the Linux vm.swappiness parameter to avoid swap.

Getting ready

Make sure you have root privileges on your nodes in the cluster.

How to do it... To tune the Linux parameter to avoid swapping, invoke the following on each node in the cluster:

  1. 1. Execute the following command to set vm.swappiness to 0:

    root# sysctl -w vm.swappiness=0
    vm.swappiness = 0
    
    • This change will persist until the next reboot of the server.

  2. 2. Add the...

Java GC and HBase heap settings


As HBase runs within JVM, the JVM Garbage Collection (GC) settings are very important for HBase to run smoothly, with high performance. In addition to the general guideline for configuring HBase heap settings, it's also important to have the HBase processes output their GC logs and then tune the JVM settings based on the GC logs' output.

We will describe the most important HBase JVM heap settings as well as how to enable and understand GC logging, in this recipe. We will also cover some general guidelines to tune Java GC settings for HBase.

Getting ready

Log in to your HBase region server.

How to do it...

The following are the recommended Java GC and HBase heap settings:

  1. 1. Give HBase enough heap size by editing the hbase-env.sh file. For example, the following snippet configures a 8000-MB heap size for HBase:

    $ vi $HBASE_HOME/conf/hbase-env.sh
    export HBASE_HEAPSIZE=8000
    
  2. 2. Enable GC logging by using the following command:

    $ vi $HBASE_HOME/conf/hbase-env.sh...

Using compression


One of the most important features of HBase is the use of data compression. It's important because:

  • Compression reduces the number of bytes written to/read from HDFS

  • Saves disk usage

  • Improves the efficiency of network bandwidth when getting data from a remote server

HBase supports the GZip and LZO codec. Our suggestion is to use the LZO compression algorithm because of its fast data decompression and low CPU usage. As a better compression ratio is preferred for the system, you should consider GZip.

Unfortunately, HBase cannot ship with LZO because of a license issue. HBase is Apache-licensed, whereas LZO is GPL-licensed. Therefore, we need to install LZO ourselves. We will use the hadoop-lzo library, which brings splittable LZO compression to Hadoop.

In this recipe, we will describe how to install LZO and how to configure HBase to use LZO compression.

Getting ready

Make sure Java is installed on the machine on which hadoop-lzo is to be built.

Apache Ant is required to build...

Managing compactions


An HBase table has the following physical storage structure:

It consists of multiple regions. While a region may have several Stores, each holds a single column family. An edit first writes to the hosting region store's in-memory space, which is called MemStore. When the size of MemStore reaches a threshold, it is flushed to StoreFiles on HDFS.

As data increases, there may be many StoreFiles on HDFS, which is not good for its performance. Thus, HBase will automatically pick up a couple of the smaller StoreFiles and rewrite them into a bigger one. This process is called minor compaction. For certain situations, or when triggered by a configured interval (once a day by default), major compaction runs automatically. Major compaction will drop the deleted or expired cells and rewrite all the StoreFiles in the Store into a single StoreFile; this usually improves the performance.

However, as major compaction rewrites all of the Stores' data, lots of disk I/O and network...

Managing a region split


Usually an HBase table starts with a single region. However, as data keeps growing and the region reaches its configured maximum size, it is automatically split into two halves, so that they can handle more data. The following diagram shows an HBase region splitting:

This is the default behavior of HBase region splitting. This mechanism works well for many cases, however there are situations wherein it encounters problems, such as the split/compaction storms issue.

With a roughly uniform data distribution and growth, eventually all the regions in the table will need to be split at the same time. Immediately following a split, compactions will run on the daughter regions to rewrite their data into separate files. This causes a large amount of disk I/O and network traffic.

In order to avoid this situation, you can turn off automatic splitting and manually invoke it. As you can control at what time to invoke the splitting, it helps spread the I/O load. Another advantage...

lock icon The rest of the chapter is locked
You have been reading a chapter from
HBase Administration Cookbook
Published in: Aug 2012 Publisher: Packt ISBN-13: 9781849517140
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}