Reader small image

You're reading from  HBase Administration Cookbook

Product typeBook
Published inAug 2012
PublisherPackt
ISBN-139781849517140
Edition1st Edition
Right arrow
Author (1)
Yifeng Jiang
Yifeng Jiang
author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang

Right arrow

Chapter 9. Advanced Configurations and Tuning

In this chapter, we will cover:

  • Benchmarking HBase cluster with YCSB

  • Increasing region server handler count

  • Precreating regions using your own algorithm

  • Avoiding update blocking on write-heavy clusters

  • Tuning memory size for MemStores

  • Client side tuning for low latency systems

  • Configuring block cache for column families

  • Increasing block cache size on read-heavy clusters

  • Client side scanner setting

  • Tuning block size to improve seek performance

  • Enabling Bloom Filter to improve the overall throughput

Introduction


This is another chapter about performance tuning. In Chapter 8, Basic Performance Tuning, we described some recipes to tune Hadoop, OS settings, Java, and HBase itself to improve the overall performance of the HBase cluster. Those are general improvements for many use cases. In this chapter, we will describe more "specific" recipes; some of them are for write-heavy clusters, while some are aimed to improve read performance of the cluster.

Before tuning a HBase cluster, you will need to know how its performance is. Therefore, we will start by introducing how to use Yahoo! Cloud Serving Benchmark (YCSB) to measure (benchmark) performance of a HBase cluster.

In the recipe Precreating regions before moving data into HBase in Chapter 2, we introduced how to use HBase's RegionSplitter utility to create a table with precreated regions to improve data loading speed. While RegionSplitter by default precreate regions with MD5 number boundaries, for situations where row keys cannot be represented...

Benchmarking HBase cluster with YCSB


Measuring the performance of a HBase cluster, or benchmarking the cluster, is as important as tuning the cluster itself. The performance characteristics of a HBase cluster that we should measure include at least the following:

  • Overall throughput (operations per second) of the cluster

  • Average latency (average time per operation) of the cluster

  • Minimum latency

  • Maximum latency

  • Distribution of operation latencies

YCSB is a great tool to benchmark performance of HBase clusters. YCSB supports running variable load tests in parallel, to evaluate the insert, update, delete, and read performance of the system. Therefore, you can use YCSB to benchmark for both write-heavy and read-heavy HBase clusters. The record count to load, operations to perform, proportion of read and write, and many other properties are configurable for each test, so it is easy to use YCSB to test different load scenarios of the cluster.

YCSB can also be used to evaluate the performance of many...

Increasing region server handler count


Region server keeps a number of running threads to answer incoming requests to user tables. To prevent region server running out of memory, this number is set to very low by default. For many situations, especially when you have lots of concurrent clients, you will need to increase this number to handle more requests.

We will describe how to tune the region server handler count in this recipe.

Getting ready

Log in to the master node as the user who starts HBase.

How to do it...

The following steps need to be followed to increase region server handler count:

  1. 1. On the master node, add the following to your hbase-site.xml file:

    hadoop@master1$ vi $HBASE_HOME/conf/hbase-site.xml
    <property>
    <name>hbase.regionserver.handler.count</name>
    <value>40</value>
    </property>
    
  2. 2. Sync the changes across the cluster:

    hadoop@master1$ for slave in `cat $HBASE_HOME/conf/regionservers`
    do
    rsync -avz $HBASE_HOME/conf/ $slave:$HBASE_HOME/conf...

Precreating regions using your own algorithm


When we create a table in HBase, the table starts with a single region. All data inserted into that table goes to the single region. As data keeps growing, when the size of the region reaches a threshold, Region Splitting happens. The single region is split into two halves so that the table can handle more data.

In a write-heavy HBase cluster, this approach has several issues that need to be fixed:

  • The split/compaction storm issue.

    As data grows uniformly, most of the regions are split at the same time, which causes huge disk I/O and network traffics.

  • Load is not well balanced until enough regions have been split.

    Especially right after the table is created, all requests go to the same region server where the first region is deployed.

The split/compaction issue has been discussed in the Managing region split recipe in Chapter 8, Basic Performance Tuning. by using a manually splitting approach. For the second issue, we introduced how to avoid it...

Avoiding update blocking on write-heavy clusters


On a write-heavy HBase cluster, you may observe an unstable write speed. Most of the writes are very fast, while some are slow. For an online system, this unstable write speed is not acceptable even when average speed is very fast.

This situation is probably caused by the following two reasons:

  • Split/compaction makes the cluster very high load

  • Updates are blocked by region server

As we described in Chapter 8, Basic Performance Tuning you can avoid the split/compaction issues by disabling the automatic split/compaction and invoking them at low load time.

Grep your region server logs, if you find many messages saying "Blocking updates for ...", it is possible that many updates were blocked, and those updates might have poor response time.

To fix this issue, we need to tune both server side and client side configurations to gain a stable write speed. We will describe the most important server side tuning to avoid update blocking in this recipe.

Getting...

Tuning memory size for MemStores


As we described in recipe Avoiding update blocking on write-heavy clusters, HBase write operations are applied in the hosting region's MemStore at first, and then flushed to HDFS to save memory space when MemStore size reaches a threshold. MemStore flush runs on background threads using a snapshot of the MemStore. Thus HBase keeps handling writes even when the MemStores are being flushed. This makes HBase writes very fast. If the write spike is so high that the MemStore flush cannot catch up, the speed writes fill MemStores and memory used by MemStores will keep growing. If the size of all MemStores in a region server reaches a configurable threshold, updates are blocked and flushes are forced.

We will describe how to tune this total MemStore memory size to avoid update blocking in this recipe.

Getting ready

Log in to your master node as the user who starts HBase.

How to do it...

The following steps need to be carried out to tune memory size for MemStores:

  1. 1...

Client-side tuning for low latency systems


We have introduced several recipes to avoid server side blocking. Those recipes should help the cluster run stably and with high performance. Cluster throughput and average latency will be improved significantly by server-side tuning.

However, in low latency and real-time systems, just server-side tuning is not enough. Even if it only occurs slightly, long time pause is not acceptable in low latency systems.

There are client-side configurations we can tune to avoid long time pause. In this recipe, we will describe how to tune those configurations and how they work.

Getting ready

Log in to your HBase client node as the user who accesses HBase.

How to do it...

Follow these instructions to perform client side tuning for write-heavy clusters:

  1. 1. Reduce the hbase.client.pause property value in the hbase-site.xml file:

    $ vi $HBASE_HOME/conf/hbase-site.xml
    <property>
    <name>hbase.client.pause</name>
    <value>20</value>
    </property...

Configuring block cache for column families


HBase supports block cache to improve read performance. When performing a scan, if block cache is enabled and there is room remaining, data blocks read from StoreFiles on HDFS are cached in region server's Java heap space, so that next time, accessing data in the same block can be served by the cached block. Block cache helps in reducing disk I/O for retrieving data.

Block cache is configurable at table's column family level. Different column families can have different cache priorities or even disable the block cache. Applications leverage this cache mechanism to fit different data sizes and access patterns.

In this recipe, we will describe how to configure block cache for column families and tips to leverage HBase block cache.

Getting ready

Log in to your HBase client node.

How to do it...

The following steps need to be carried out to configure block cache at column family level:

  1. 1. Start HBase Shell:

    $ $HBASE_HOME/bin/hbase shell
    HBase Shell; enter...

Client side scanner setting


To achieve better read performance, besides server side tuning, what's important is the scanner setting at client application side. Better client scanner setting makes the scan process much more efficient. By contrast, a badly configured scanner will not only slow down the scan itself, but also have a negative effect on the region server. So we need to configure the client side scanner settings carefully.

The most important scanner settings include scan caching, scan attribute selection, and scan block caching. We will describe how to configure these settings properly in this recipe.

Getting ready

Log in to your HBase client node by the user who accesses HBase.

How to do it...

The following steps need to be followed to change client side scanner settings:

  1. 1. To fetch more rows when calling the next() method on a scanner, increase the hbase.client.scanner.caching property value in the hbase-site.xml file:

    $ vi $HBASE_HOME/conf/hbase-site.xml
    <property>
    <name...

Tuning block size to improve seek performance


HBase data are stored as StoreFile in the HFile format. StoreFiles are composed of HFile blocks. HFile block is the smallest unit of data that HBase reads from its StoreFiles. It is also the basic element that region server caches in the block cache.

The size of the HFile block is an important tuning parameter. To achieve better performance, we should select different block sizes, based on the average Key/Value size and disk I/O speed. Like block cache and Bloom Filter, HFile block size is also configurable at the column family level.

We will describe how to show the average Key/Value size and tune block size to improve seek performance in this recipe.

Getting ready

Log in to your HBase client node.

How to do it...

The following steps need to be carried out to tune block size to improve seek performance:

  1. 1. Use the following command to show the average Key/Value size in a HFile. Change the file path to fit your environment. HFiles for a particular...

Enabling Bloom Filter to improve the overall throughput


HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter is a space-efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. Here are the details of Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter.

Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile's block index, which stores the start row key of each block in the StoreFile. It is very likely that the row key we are finding will drop in between two block start keys; if it does then HBase has to load the block and scan from the block's start key to figure out if that row key actually exists.

The problem here is that there will be a number of StoreFiles that exist before a major compaction aggregates them into a single one. Thus several StoreFiles may have some cells of the requested row key.

Think about the following example; it is an image showing...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
HBase Administration Cookbook
Published in: Aug 2012Publisher: PacktISBN-13: 9781849517140
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang