Packt+ | Advance your knowledge in tech

You're reading from HBase Administration Cookbook

Product type Book

Published in Aug 2012

Publisher Packt

ISBN-13 9781849517140

Pages 332 pages

Edition 1st Edition

Languages

Concepts

Database Administration

Author (1):

Yifeng Jiang

Table of Contents (16) Chapters

HBase Administration Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Setting Up HBase Cluster

Data Migration

Using Administration Tools

Backing Up and Restoring HBase Data

Monitoring and Diagnosis

Maintenance and Security

Troubleshooting

Basic Performance Tuning

Advanced Configurations and Tuning

Chapter 4. Backing Up and Restoring HBase Data

In this chapter, we will cover:

Full shutdown backup using distcp
Using CopyTable to copy data from one table to another
Exporting an HBase table to dump files on HDFS
Restoring HBase data by importing dump files from HDFS
Backing up NameNode metadata
Backing up region starting keys
Cluster replication

Introduction

If you are thinking about using HBase in production, you will probably want to understand the backup options and practices of HBase. The challenge is that the dataset you need to back up might be huge, so the backup solution must be efficient. It is expected to be able to scale to hundreds of terabytes of storage, and finish restoring the data in a reasonable time frame.

There are two strategies for backing up HBase:

Backing it up with a full cluster shutdown
Backing it up on a live cluster

A full shutdown backup has to stop HBase (or disable all tables) at first, then use Hadoop's distcp command to copy the contents of an HBase directory to either another directory on the same HDFS, or to a different HDFS. To restore from a full shutdown backup, just copy the backed up files, back to the HBase directory using distcp.

There are several approaches for a live cluster backup:

Using the CopyTable utility to copy data from one table to another
Exporting an HBase table to HDFS files...

Full shutdown backup using distcp

distcp (distributed copy) is a tool provided by Hadoop for copying a large dataset on the same, or different HDFS cluster. It uses MapReduce to copy files in parallel, handle error and recovery, and report the job status.

As HBase stores all its files, including system files on HDFS, we can simply use distcp to copy the HBase directory to either another directory on the same HDFS, or to a different HDFS, for backing up the source HBase cluster.

Note that this is a full shutdown backup solution. The distcp tool works because the HBase cluster is shut down (or all tables are disabled) and there are no edits to files during the process. Do not use distcp on a live HBase cluster. Therefore, this solution is for the environment that can tolerate a periodic full shutdown of their HBase cluster. For example, a cluster that is used for backend batch processing and not serving frontend requests.

We will describe how to use distcp to back up a fully shut down HBase...

Using CopyTable to copy data from one table to another

CopyTable is a utility to copy the data of one table to another table, either on the same cluster, or on a different HBase cluster. You can copy to a table that is on the same cluster; however, if you have another cluster that you want to treat as a backup, you might want to use CopyTable as a live backup option to copy the data of a table to the backup cluster.

CopyTable is configurable with a start and an end timestamp. If specified, only the data with a timestamp in the specific time frame will be copied. This feature makes it possible for incremental backup of an HBase table in some situations.

Note

"Incremental backup" is a method to only back up the data that has been changed during the last backup.

Note

Note: Since the cluster keeps running, there is a risk that edits could be missed during the copy process.

In this recipe, we will describe how to use CopyTable to copy the data of a table to another one, on a different HBase cluster...

Exporting an HBase table to dump files on HDFS

The HBase export utility dumps the contents of a table to the same HDFS cluster. The dump file is in a Hadoop sequence file format. Exporting data to Hadoop sequence files has merits for data backup, because the Hadoop sequence file format supports several compression types and algorithms. With it we can choose the best compression options to fit our environment.

Like the copytable utility we mentioned in the previous recipe, export is configurable with a start and an end timestamp, so that only the data within a specific time frame will be dumped. This feature enables export to incrementally export an HBase table to HDFS.

HBase export is also a live backup option. As the cluster is running, there is a risk that edits could be missed during the export process. In this recipe, we will describe how to use the export utility to export a table to HDFS on the same cluster. We will introduce the import utility in the next recipe, which is used to...

Restoring HBase data by importing dump files from HDFS

The HBase Import utility is used to load data that has been exported by the Export utility into an existing HBase table. It is the process to restore data from the Export utility backup solution.

We will look at the usage of the Import utility in this recipe.

Getting ready

First, start your HDFS and HBase cluster.

We will import the files that we exported in the previous recipe into our hly_temp table. If you do not have those dump files, refer to the Exporting HBase table to dump files on HDFS recipe, to generate the dump files in advance. We assume the dump files are saved in the /backup/hly_temp directory.

The Import utility uses MapReduce to import data. Add the HBase configurable file (hbase-site.xml) and dependency JAR files to Hadoop class path on your client node.

How to do it...

To import dump files into the hly_temp table:

1. Connect to your HBase cluster via HBase Shell and create the target table if it does not exist:
```
hbase>...
```

Backing up NameNode metadata

As HBase runs within HDFS, in addition to taking care of the HBase cluster, it is also important to keep your HDFS running on a healthy status. NameNode is the most important component in an HDFS cluster. A NameNode crash makes the entire HDFS cluster inaccessible. The metadata of an HDFS cluster, including the filesystem image and edit log, is managed by NameNode.

We need to protect our NameNode metadata for two situations:

Metadata lost in the event of a crash
Metadata corruption by any reason

For the first situation, we can set up NameNode to write its metadata to its local disk, along with an NFS mount. As described in the Setting up multiple, highly available (HA) masters recipe, in Chapter 1, Setting Up HBase Cluster, we can even set up multiple NameNode nodes to achieve high availability.

Our solution for the second situation, is to back up the metadata frequently so that we can restore the NameNode state in case of metadata corruption.

We will describe...

Backing up region starting keys

Besides the tables in HBase, we should back up the region starting keys for each table. Region starting keys determine the data distribution in a table, as regions are split by region starting keys. A region is the basic unit for load balancing and metrics gathering in HBase.

There is no need to back up the region starting keys if you are performing full shutdown backups using distcp, because distcp also copies region boundaries to the backup cluster.

But for the live backup options, backing up region starting keys is as important as the table data, which is especially true if your data distribution is difficult to calculate in advance or your regions are manually split. It is important because live backup options, including the CopyTable and Export utilities use the normal HBase client API to restore data in a MapReduce job. The restoring speed can be improved dramatically if we precreate well-split regions before running the restore MapReduce job.

We will...

Cluster replication

HBase supports cluster replication, which is a way to copy data between the HBase clusters. For example, it can be used as a way to easily ship edits from a real-time frontend cluster to a batch purpose cluster on the backend.

The basic architecture of an HBase replication is very practical. The master cluster captures write ahead log (WAL), and puts replicable Key/Values (edits of the column family with replication support) from the log into the replication queue. The replication message is then sent to the peer cluster, and then replayed on that cluster using its normal HBase client API. The master cluster also keeps the current position of the WAL being replicated in ZooKeeper for failure recovery.

Because the HBase replication is done asynchronously, the clusters participating in the replication can be geographically distant. It is not a problem if the connections between them are offline for some time, as the master cluster will track the replication, and recover...