Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
HBase Administration Cookbook

You're reading from  HBase Administration Cookbook

Product type Book
Published in Aug 2012
Publisher Packt
ISBN-13 9781849517140
Pages 332 pages
Edition 1st Edition
Languages
Author (1):
Yifeng Jiang Yifeng Jiang
Profile icon Yifeng Jiang

Table of Contents (16) Chapters

HBase Administration Cookbook
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Setting Up HBase Cluster Data Migration Using Administration Tools Backing Up and Restoring HBase Data Monitoring and Diagnosis Maintenance and Security Troubleshooting Basic Performance Tuning Advanced Configurations and Tuning

Chapter 4. Backing Up and Restoring HBase Data

In this chapter, we will cover:

  • Full shutdown backup using distcp

  • Using CopyTable to copy data from one table to another

  • Exporting an HBase table to dump files on HDFS

  • Restoring HBase data by importing dump files from HDFS

  • Backing up NameNode metadata

  • Backing up region starting keys

  • Cluster replication

Introduction


If you are thinking about using HBase in production, you will probably want to understand the backup options and practices of HBase. The challenge is that the dataset you need to back up might be huge, so the backup solution must be efficient. It is expected to be able to scale to hundreds of terabytes of storage, and finish restoring the data in a reasonable time frame.

There are two strategies for backing up HBase:

  • Backing it up with a full cluster shutdown

  • Backing it up on a live cluster

A full shutdown backup has to stop HBase (or disable all tables) at first, then use Hadoop's distcp command to copy the contents of an HBase directory to either another directory on the same HDFS, or to a different HDFS. To restore from a full shutdown backup, just copy the backed up files, back to the HBase directory using distcp.

There are several approaches for a live cluster backup:

  • Using the CopyTable utility to copy data from one table to another

  • Exporting an HBase table to HDFS files...

Full shutdown backup using distcp


distcp (distributed copy) is a tool provided by Hadoop for copying a large dataset on the same, or different HDFS cluster. It uses MapReduce to copy files in parallel, handle error and recovery, and report the job status.

As HBase stores all its files, including system files on HDFS, we can simply use distcp to copy the HBase directory to either another directory on the same HDFS, or to a different HDFS, for backing up the source HBase cluster.

Note that this is a full shutdown backup solution. The distcp tool works because the HBase cluster is shut down (or all tables are disabled) and there are no edits to files during the process. Do not use distcp on a live HBase cluster. Therefore, this solution is for the environment that can tolerate a periodic full shutdown of their HBase cluster. For example, a cluster that is used for backend batch processing and not serving frontend requests.

We will describe how to use distcp to back up a fully shut down HBase...

Using CopyTable to copy data from one table to another


CopyTable is a utility to copy the data of one table to another table, either on the same cluster, or on a different HBase cluster. You can copy to a table that is on the same cluster; however, if you have another cluster that you want to treat as a backup, you might want to use CopyTable as a live backup option to copy the data of a table to the backup cluster.

CopyTable is configurable with a start and an end timestamp. If specified, only the data with a timestamp in the specific time frame will be copied. This feature makes it possible for incremental backup of an HBase table in some situations.

Note

"Incremental backup" is a method to only back up the data that has been changed during the last backup.

Note

Note: Since the cluster keeps running, there is a risk that edits could be missed during the copy process.

In this recipe, we will describe how to use CopyTable to copy the data of a table to another one, on a different HBase cluster...

Exporting an HBase table to dump files on HDFS


The HBase export utility dumps the contents of a table to the same HDFS cluster. The dump file is in a Hadoop sequence file format. Exporting data to Hadoop sequence files has merits for data backup, because the Hadoop sequence file format supports several compression types and algorithms. With it we can choose the best compression options to fit our environment.

Like the copytable utility we mentioned in the previous recipe, export is configurable with a start and an end timestamp, so that only the data within a specific time frame will be dumped. This feature enables export to incrementally export an HBase table to HDFS.

HBase export is also a live backup option. As the cluster is running, there is a risk that edits could be missed during the export process. In this recipe, we will describe how to use the export utility to export a table to HDFS on the same cluster. We will introduce the import utility in the next recipe, which is used to...

Restoring HBase data by importing dump files from HDFS


The HBase Import utility is used to load data that has been exported by the Export utility into an existing HBase table. It is the process to restore data from the Export utility backup solution.

We will look at the usage of the Import utility in this recipe.

Getting ready

First, start your HDFS and HBase cluster.

We will import the files that we exported in the previous recipe into our hly_temp table. If you do not have those dump files, refer to the Exporting HBase table to dump files on HDFS recipe, to generate the dump files in advance. We assume the dump files are saved in the /backup/hly_temp directory.

The Import utility uses MapReduce to import data. Add the HBase configurable file (hbase-site.xml) and dependency JAR files to Hadoop class path on your client node.

How to do it...

To import dump files into the hly_temp table:

  1. 1. Connect to your HBase cluster via HBase Shell and create the target table if it does not exist:

    hbase>...

Backing up NameNode metadata


As HBase runs within HDFS, in addition to taking care of the HBase cluster, it is also important to keep your HDFS running on a healthy status. NameNode is the most important component in an HDFS cluster. A NameNode crash makes the entire HDFS cluster inaccessible. The metadata of an HDFS cluster, including the filesystem image and edit log, is managed by NameNode.

We need to protect our NameNode metadata for two situations:

  • Metadata lost in the event of a crash

  • Metadata corruption by any reason

For the first situation, we can set up NameNode to write its metadata to its local disk, along with an NFS mount. As described in the Setting up multiple, highly available (HA) masters recipe, in Chapter 1, Setting Up HBase Cluster, we can even set up multiple NameNode nodes to achieve high availability.

Our solution for the second situation, is to back up the metadata frequently so that we can restore the NameNode state in case of metadata corruption.

We will describe...

Backing up region starting keys


Besides the tables in HBase, we should back up the region starting keys for each table. Region starting keys determine the data distribution in a table, as regions are split by region starting keys. A region is the basic unit for load balancing and metrics gathering in HBase.

There is no need to back up the region starting keys if you are performing full shutdown backups using distcp, because distcp also copies region boundaries to the backup cluster.

But for the live backup options, backing up region starting keys is as important as the table data, which is especially true if your data distribution is difficult to calculate in advance or your regions are manually split. It is important because live backup options, including the CopyTable and Export utilities use the normal HBase client API to restore data in a MapReduce job. The restoring speed can be improved dramatically if we precreate well-split regions before running the restore MapReduce job.

We will...

Cluster replication


HBase supports cluster replication, which is a way to copy data between the HBase clusters. For example, it can be used as a way to easily ship edits from a real-time frontend cluster to a batch purpose cluster on the backend.

The basic architecture of an HBase replication is very practical. The master cluster captures write ahead log (WAL), and puts replicable Key/Values (edits of the column family with replication support) from the log into the replication queue. The replication message is then sent to the peer cluster, and then replayed on that cluster using its normal HBase client API. The master cluster also keeps the current position of the WAL being replicated in ZooKeeper for failure recovery.

Because the HBase replication is done asynchronously, the clusters participating in the replication can be geographically distant. It is not a problem if the connections between them are offline for some time, as the master cluster will track the replication, and recover...

lock icon The rest of the chapter is locked
You have been reading a chapter from
HBase Administration Cookbook
Published in: Aug 2012 Publisher: Packt ISBN-13: 9781849517140
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}