You're reading from HBase Administration Cookbook
There are several ways to move data into HBase:
Using the HBase Put API
Using the HBase bulk load tool
Using a customized MapReduce job
The HBase Put API is the most straightforward method. Its usage is not difficult to learn. For most situations however, it is not always the most efficient method. This is especially true when a large amount of data needs to be transferred into HBase within a limited time period. The volume of data to be taken care of is usually huge, and that's probably why you will be using HBase rather than another database. You have to think about how to move all that data into HBase carefully at the beginning of your HBase project; otherwise you might run into serious performance problems.
HBase has the bulk load feature to support loading of huge volumes of data efficiently into HBase. The bulk load feature uses a MapReduce job to load data into a specific HBase table by generating HBase's internal HFile data format files and then loading the data files...
The most usual case of data migration might be importing data from an existing RDBMS into HBase. For this kind of task, the most simple and straightforward way could be to fetch the data from a single client and then put it into HBase, using the HBase Put API. It works well if there is not too much data to transfer.
This recipe describes importing data from MySQL into HBase using its Put API. All the operations will be executed on a single client. MapReduce is not included in this recipe. This recipe leads you through creating an HBase table via HBase Shell, connecting to the cluster from Java, and then putting data into HBase.
Public data sets are an ideal data source to practice HBase data migration. There are many public data sets available on the internet. We will use the NOAA'S 1981-2010 CLIMATE NORMALS public data set in this book. You can access it at http://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/.
This is climate statistics...
HBase has an importtsv
tool to support importing data from TSV files into HBase. Using this tool to load text data into HBase is very efficient, because it runs a MapReduce job to perform the importing. Even if you are going to load data from an existing RDBMS, you can dump data into a text file somehow and then use importtsv
to import dumped data into HBase. This approach works well when importing a huge amount of data, as dumping data is much faster than executing SQL on RDBMS.
The importtsv
tool does not only load data directly into an HBase table, it also supports generating HBase internal format (HFile) files, so that you can use the HBase bulk load tool to load generated files directly into a running HBase cluster. This way, you reduce network traffic that was generated from the data transfers and your HBase load, during the migration.
This recipe describes usage of the importtsv
and bulk load tools. We first demonstrate loading...
Although the importtsv
tool is very useful for loading text files into HBase, in many cases, for full control of the loading process you may want to write your own MapReduce job to import data into HBase. For example, the importtsv
tool does not work if you are going to load files of other formats.
HBase provides TableOutputFormat
for writing data into an HBase table from a MapReduce job. You can also generate its internal HFile format files in your MapReduce job by using the HFileOutputFormat
class, and then load the generated files into a running HBase cluster, using the completebulkload
tool we described in the previous recipe.
In this recipe, we will explain the steps for loading data using your own MapReduce job. We will first describe how to use TableOutputFormat
. In the There's more... section, we will explain how to generate HFile format files in a MapReduce job.
Each HBase row belongs to a particular region. A region holds a range of sorted HBase rows. Regions are deployed and managed by a region server.
When we create a table in HBase, the table starts with a single region. All data inserted into the table goes to the single region, first. Data keeps being inserted, and when it reaches a threshold, the region will be split into two halves. This is called region splitting. Split regions will be distributed to other region servers, so that the load can be balanced among the clusters.
As you can imagine, if we can initialize the table with precreated regions, using an appropriate algorithm, the load of the data migration will be balanced over the entire cluster, which increases data load speed significantly.
We will describe how to create a table with precreated regions in this recipe.