Reader small image

You're reading from  HBase Administration Cookbook

Product typeBook
Published inAug 2012
PublisherPackt
ISBN-139781849517140
Edition1st Edition
Right arrow
Author (1)
Yifeng Jiang
Yifeng Jiang
author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang

Right arrow

Chapter 2. Data Migration

In this chapter, we will cover:

  • Importing data from MySQL using a single client

  • Importing data from TSV files using the bulk load tool

  • Writing your own MapReduce job to import data

  • Precreating regions before moving data into HBase

Introduction


There are several ways to move data into HBase:

  • Using the HBase Put API

  • Using the HBase bulk load tool

  • Using a customized MapReduce job

The HBase Put API is the most straightforward method. Its usage is not difficult to learn. For most situations however, it is not always the most efficient method. This is especially true when a large amount of data needs to be transferred into HBase within a limited time period. The volume of data to be taken care of is usually huge, and that's probably why you will be using HBase rather than another database. You have to think about how to move all that data into HBase carefully at the beginning of your HBase project; otherwise you might run into serious performance problems.

HBase has the bulk load feature to support loading of huge volumes of data efficiently into HBase. The bulk load feature uses a MapReduce job to load data into a specific HBase table by generating HBase's internal HFile data format files and then loading the data files...

Importing data from MySQL via single client


The most usual case of data migration might be importing data from an existing RDBMS into HBase. For this kind of task, the most simple and straightforward way could be to fetch the data from a single client and then put it into HBase, using the HBase Put API. It works well if there is not too much data to transfer.

This recipe describes importing data from MySQL into HBase using its Put API. All the operations will be executed on a single client. MapReduce is not included in this recipe. This recipe leads you through creating an HBase table via HBase Shell, connecting to the cluster from Java, and then putting data into HBase.

Getting ready

Public data sets are an ideal data source to practice HBase data migration. There are many public data sets available on the internet. We will use the NOAA'S 1981-2010 CLIMATE NORMALS public data set in this book. You can access it at http://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/.

This is climate statistics...

Importing data from TSV files using the bulk load tool


HBase has an importtsv tool to support importing data from TSV files into HBase. Using this tool to load text data into HBase is very efficient, because it runs a MapReduce job to perform the importing. Even if you are going to load data from an existing RDBMS, you can dump data into a text file somehow and then use importtsv to import dumped data into HBase. This approach works well when importing a huge amount of data, as dumping data is much faster than executing SQL on RDBMS.

The importtsv tool does not only load data directly into an HBase table, it also supports generating HBase internal format (HFile) files, so that you can use the HBase bulk load tool to load generated files directly into a running HBase cluster. This way, you reduce network traffic that was generated from the data transfers and your HBase load, during the migration.

This recipe describes usage of the importtsv and bulk load tools. We first demonstrate loading...

Writing your own MapReduce job to import data


Although the importtsv tool is very useful for loading text files into HBase, in many cases, for full control of the loading process you may want to write your own MapReduce job to import data into HBase. For example, the importtsv tool does not work if you are going to load files of other formats.

HBase provides TableOutputFormat for writing data into an HBase table from a MapReduce job. You can also generate its internal HFile format files in your MapReduce job by using the HFileOutputFormat class, and then load the generated files into a running HBase cluster, using the completebulkload tool we described in the previous recipe.

In this recipe, we will explain the steps for loading data using your own MapReduce job. We will first describe how to use TableOutputFormat. In the There's more... section, we will explain how to generate HFile format files in a MapReduce job.

Getting ready

We will use the raw NOAA hly-temp-normal.txt file in this...

Precreating regions before moving data into HBase


Each HBase row belongs to a particular region. A region holds a range of sorted HBase rows. Regions are deployed and managed by a region server.

When we create a table in HBase, the table starts with a single region. All data inserted into the table goes to the single region, first. Data keeps being inserted, and when it reaches a threshold, the region will be split into two halves. This is called region splitting. Split regions will be distributed to other region servers, so that the load can be balanced among the clusters.

As you can imagine, if we can initialize the table with precreated regions, using an appropriate algorithm, the load of the data migration will be balanced over the entire cluster, which increases data load speed significantly.

We will describe how to create a table with precreated regions in this recipe.

Getting ready

Log in to your HBase client node.

How to do it...

Execute the following command on the client node:

$ $HBASE_HOME...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
HBase Administration Cookbook
Published in: Aug 2012Publisher: PacktISBN-13: 9781849517140
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang