Packt+ | Advance your knowledge in tech

You're reading from HBase Administration Cookbook

Product typeBook

Published inAug 2012

PublisherPackt

ISBN-139781849517140

Edition1st Edition

Tools

Hadoop HBase

Concepts

Database Administration

Author (1)

Yifeng Jiang

Chapter 9. Advanced Configurations and Tuning

In this chapter, we will cover:

Benchmarking HBase cluster with YCSB
Increasing region server handler count
Precreating regions using your own algorithm
Avoiding update blocking on write-heavy clusters
Tuning memory size for MemStores
Client side tuning for low latency systems
Configuring block cache for column families
Increasing block cache size on read-heavy clusters
Client side scanner setting
Tuning block size to improve seek performance
Enabling Bloom Filter to improve the overall throughput

Introduction

This is another chapter about performance tuning. In Chapter 8, Basic Performance Tuning, we described some recipes to tune Hadoop, OS settings, Java, and HBase itself to improve the overall performance of the HBase cluster. Those are general improvements for many use cases. In this chapter, we will describe more "specific" recipes; some of them are for write-heavy clusters, while some are aimed to improve read performance of the cluster.

Before tuning a HBase cluster, you will need to know how its performance is. Therefore, we will start by introducing how to use Yahoo! Cloud Serving Benchmark (YCSB) to measure (benchmark) performance of a HBase cluster.

In the recipe Precreating regions before moving data into HBase in Chapter 2, we introduced how to use HBase's RegionSplitter utility to create a table with precreated regions to improve data loading speed. While RegionSplitter by default precreate regions with MD5 number boundaries, for situations where row keys cannot be represented...

Benchmarking HBase cluster with YCSB

Measuring the performance of a HBase cluster, or benchmarking the cluster, is as important as tuning the cluster itself. The performance characteristics of a HBase cluster that we should measure include at least the following:

Overall throughput (operations per second) of the cluster
Average latency (average time per operation) of the cluster
Minimum latency
Maximum latency
Distribution of operation latencies

YCSB is a great tool to benchmark performance of HBase clusters. YCSB supports running variable load tests in parallel, to evaluate the insert, update, delete, and read performance of the system. Therefore, you can use YCSB to benchmark for both write-heavy and read-heavy HBase clusters. The record count to load, operations to perform, proportion of read and write, and many other properties are configurable for each test, so it is easy to use YCSB to test different load scenarios of the cluster.

YCSB can also be used to evaluate the performance of many...

Increasing region server handler count

Region server keeps a number of running threads to answer incoming requests to user tables. To prevent region server running out of memory, this number is set to very low by default. For many situations, especially when you have lots of concurrent clients, you will need to increase this number to handle more requests.

We will describe how to tune the region server handler count in this recipe.

Getting ready

How to do it...

The following steps need to be followed to increase region server handler count:

1. On the master node, add the following to your hbase-site.xml file:

hadoop@master1$ vi $HBASE_HOME/conf/hbase-site.xml
<property>
<name>hbase.regionserver.handler.count</name>
<value>40</value>
</property>

2. Sync the changes across the cluster:

hadoop@master1$ for slave in `cat $HBASE_HOME/conf/regionservers`
do
rsync -avz $HBASE_HOME/conf/ $slave:$HBASE_HOME/conf...

Precreating regions using your own algorithm

When we create a table in HBase, the table starts with a single region. All data inserted into that table goes to the single region. As data keeps growing, when the size of the region reaches a threshold, Region Splitting happens. The single region is split into two halves so that the table can handle more data.

In a write-heavy HBase cluster, this approach has several issues that need to be fixed:

The split/compaction storm issue.
As data grows uniformly, most of the regions are split at the same time, which causes huge disk I/O and network traffics.
Load is not well balanced until enough regions have been split.
Especially right after the table is created, all requests go to the same region server where the first region is deployed.

The split/compaction issue has been discussed in the Managing region split recipe in Chapter 8, Basic Performance Tuning. by using a manually splitting approach. For the second issue, we introduced how to avoid it...

Avoiding update blocking on write-heavy clusters

On a write-heavy HBase cluster, you may observe an unstable write speed. Most of the writes are very fast, while some are slow. For an online system, this unstable write speed is not acceptable even when average speed is very fast.

This situation is probably caused by the following two reasons:

Split/compaction makes the cluster very high load
Updates are blocked by region server

As we described in Chapter 8, Basic Performance Tuning you can avoid the split/compaction issues by disabling the automatic split/compaction and invoking them at low load time.

Grep your region server logs, if you find many messages saying "Blocking updates for ...", it is possible that many updates were blocked, and those updates might have poor response time.

To fix this issue, we need to tune both server side and client side configurations to gain a stable write speed. We will describe the most important server side tuning to avoid update blocking in this recipe.

Getting...

Tuning memory size for MemStores

As we described in recipe Avoiding update blocking on write-heavy clusters, HBase write operations are applied in the hosting region's MemStore at first, and then flushed to HDFS to save memory space when MemStore size reaches a threshold. MemStore flush runs on background threads using a snapshot of the MemStore. Thus HBase keeps handling writes even when the MemStores are being flushed. This makes HBase writes very fast. If the write spike is so high that the MemStore flush cannot catch up, the speed writes fill MemStores and memory used by MemStores will keep growing. If the size of all MemStores in a region server reaches a configurable threshold, updates are blocked and flushes are forced.

We will describe how to tune this total MemStore memory size to avoid update blocking in this recipe.

Getting ready

How to do it...

The following steps need to be carried out to tune memory size for MemStores:

1...

Client-side tuning for low latency systems

We have introduced several recipes to avoid server side blocking. Those recipes should help the cluster run stably and with high performance. Cluster throughput and average latency will be improved significantly by server-side tuning.

However, in low latency and real-time systems, just server-side tuning is not enough. Even if it only occurs slightly, long time pause is not acceptable in low latency systems.

There are client-side configurations we can tune to avoid long time pause. In this recipe, we will describe how to tune those configurations and how they work.

Getting ready

How to do it...

Follow these instructions to perform client side tuning for write-heavy clusters:

1. Reduce the hbase.client.pause property value in the hbase-site.xml file:

$ vi $HBASE_HOME/conf/hbase-site.xml
<property>
<name>hbase.client.pause</name>
<value>20</value>
</property...

Configuring block cache for column families

HBase supports block cache to improve read performance. When performing a scan, if block cache is enabled and there is room remaining, data blocks read from StoreFiles on HDFS are cached in region server's Java heap space, so that next time, accessing data in the same block can be served by the cached block. Block cache helps in reducing disk I/O for retrieving data.

Block cache is configurable at table's column family level. Different column families can have different cache priorities or even disable the block cache. Applications leverage this cache mechanism to fit different data sizes and access patterns.

In this recipe, we will describe how to configure block cache for column families and tips to leverage HBase block cache.

Getting ready

How to do it...

The following steps need to be carried out to configure block cache at column family level:

1. Start HBase Shell:

$ $HBASE_HOME/bin/hbase shell
HBase Shell; enter...

Client side scanner setting

To achieve better read performance, besides server side tuning, what's important is the scanner setting at client application side. Better client scanner setting makes the scan process much more efficient. By contrast, a badly configured scanner will not only slow down the scan itself, but also have a negative effect on the region server. So we need to configure the client side scanner settings carefully.

The most important scanner settings include scan caching, scan attribute selection, and scan block caching. We will describe how to configure these settings properly in this recipe.

Getting ready

How to do it...

The following steps need to be followed to change client side scanner settings:

1. To fetch more rows when calling the next() method on a scanner, increase the hbase.client.scanner.caching property value in the hbase-site.xml file:
```
$ vi $HBASE_HOME/conf/hbase-site.xml
<property>
<name...
```

Tuning block size to improve seek performance

HBase data are stored as StoreFile in the HFile format. StoreFiles are composed of HFile blocks. HFile block is the smallest unit of data that HBase reads from its StoreFiles. It is also the basic element that region server caches in the block cache.

The size of the HFile block is an important tuning parameter. To achieve better performance, we should select different block sizes, based on the average Key/Value size and disk I/O speed. Like block cache and Bloom Filter, HFile block size is also configurable at the column family level.

We will describe how to show the average Key/Value size and tune block size to improve seek performance in this recipe.

Getting ready

How to do it...

The following steps need to be carried out to tune block size to improve seek performance:

1. Use the following command to show the average Key/Value size in a HFile. Change the file path to fit your environment. HFiles for a particular...

Enabling Bloom Filter to improve the overall throughput

HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter is a space-efficient mechanism to test whether a StoreFile contains a specific row or row-col cell. Here are the details of Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter.

Without Bloom Filter, the only way to decide if a row key is contained in a StoreFile is to check the StoreFile's block index, which stores the start row key of each block in the StoreFile. It is very likely that the row key we are finding will drop in between two block start keys; if it does then HBase has to load the block and scan from the block's start key to figure out if that row key actually exists.

The problem here is that there will be a number of StoreFiles that exist before a major compaction aggregates them into a single one. Thus several StoreFiles may have some cells of the requested row key.

Think about the following example; it is an image showing...

The rest of the chapter is locked

You have been reading a chapter from

HBase Administration Cookbook

Published in: Aug 2012Publisher: PacktISBN-13: 9781849517140

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages