Chapter 4. Backing Up and Restoring HBase Data
In this chapter, we will cover:
Full shutdown backup using distcp
Using CopyTable
to copy data from one table to another
Exporting an HBase table to dump files on HDFS
Restoring HBase data by importing dump files from HDFS
Backing up NameNode metadata
Backing up region starting keys
Cluster replication
If you are thinking about using HBase in production, you will probably want to understand the backup options and practices of HBase. The challenge is that the dataset you need to back up might be huge, so the backup solution must be efficient. It is expected to be able to scale to hundreds of terabytes of storage, and finish restoring the data in a reasonable time frame.
There are two strategies for backing up HBase:
A full shutdown backup has to stop HBase (or disable all tables) at first, then use Hadoop's distcp
command to copy the contents of an HBase directory to either another directory on the same HDFS, or to a different HDFS. To restore from a full shutdown backup, just copy the backed up files, back to the HBase directory using distcp
.
There are several approaches for a live cluster backup:
Full shutdown backup using distcp
distcp
(distributed copy) is a tool provided by Hadoop for copying a large dataset on the same, or different HDFS cluster. It uses MapReduce to copy files in parallel, handle error and recovery, and report the job status.
As HBase stores all its files, including system files on HDFS, we can simply use distcp
to copy the HBase directory to either another directory on the same HDFS, or to a different HDFS, for backing up the source HBase cluster.
Note that this is a full shutdown backup solution. The distcp
tool works because the HBase cluster is shut down (or all tables are disabled) and there are no edits to files during the process. Do not use distcp
on a live HBase cluster. Therefore, this solution is for the environment that can tolerate a periodic full shutdown of their HBase cluster. For example, a cluster that is used for backend batch processing and not serving frontend requests.
We will describe how to use distcp
to back up a fully shut down HBase...
Using CopyTable to copy data from one table to another
CopyTable
is a utility to copy the data of one table to another table, either on the same cluster, or on a different HBase cluster. You can copy to a table that is on the same cluster; however, if you have another cluster that you want to treat as a backup, you might want to use CopyTable
as a live backup option to copy the data of a table to the backup cluster.
CopyTable
is configurable with a start and an end timestamp. If specified, only the data with a timestamp in the specific time frame will be copied. This feature makes it possible for incremental backup of an HBase table in some situations.
Note
"Incremental backup" is a method to only back up the data that has been changed during the last backup.
Note
Note: Since the cluster keeps running, there is a risk that edits could be missed during the copy process.
In this recipe, we will describe how to use CopyTable
to copy the data of a table to another one, on a different HBase cluster...
Exporting an HBase table to dump files on HDFS
The HBase export
utility dumps the contents of a table to the same HDFS cluster. The dump file is in a Hadoop sequence file format. Exporting data to Hadoop sequence files has merits for data backup, because the Hadoop sequence file format supports several compression types and algorithms. With it we can choose the best compression options to fit our environment.
Like the copytable
utility we mentioned in the previous recipe, export
is configurable with a start and an end timestamp, so that only the data within a specific time frame will be dumped. This feature enables export
to incrementally export an HBase table to HDFS.
HBase export
is also a live backup option. As the cluster is running, there is a risk that edits could be missed during the export process. In this recipe, we will describe how to use the export
utility to export a table to HDFS on the same cluster. We will introduce the import
utility in the next recipe, which is used to...
Restoring HBase data by importing dump files from HDFS
The HBase Import
utility is used to load data that has been exported by the Export
utility into an existing HBase table. It is the process to restore data from the Export
utility backup solution.
We will look at the usage of the Import
utility in this recipe.
First, start your HDFS and HBase cluster.
We will import the files that we exported in the previous recipe into our hly_temp
table. If you do not have those dump files, refer to the Exporting HBase table to dump files on HDFS recipe, to generate the dump files in advance. We assume the dump files are saved in the /backup/hly_temp
directory.
The Import
utility uses MapReduce to import data. Add the HBase configurable file (hbase-site.xml
) and dependency JAR files to Hadoop class path on your client node.
To import dump files into the hly_temp
table:
1. Connect to your HBase cluster via HBase Shell and create the target table if it does not exist:
Backing up NameNode metadata
As HBase runs within HDFS, in addition to taking care of the HBase cluster, it is also important to keep your HDFS running on a healthy status. NameNode is the most important component in an HDFS cluster. A NameNode crash makes the entire HDFS cluster inaccessible. The metadata of an HDFS cluster, including the filesystem image and edit log, is managed by NameNode.
We need to protect our NameNode metadata for two situations:
For the first situation, we can set up NameNode to write its metadata to its local disk, along with an NFS mount. As described in the Setting up multiple, highly available (HA) masters recipe, in Chapter 1, Setting Up HBase Cluster, we can even set up multiple NameNode nodes to achieve high availability.
Our solution for the second situation, is to back up the metadata frequently so that we can restore the NameNode state in case of metadata corruption.
We will describe...
Backing up region starting keys
Besides the tables in HBase, we should back up the region starting keys for each table. Region starting keys determine the data distribution in a table, as regions are split by region starting keys. A region is the basic unit for load balancing and metrics gathering in HBase.
There is no need to back up the region starting keys if you are performing full shutdown backups using distcp
, because distcp
also copies region boundaries to the backup cluster.
But for the live backup options, backing up region starting keys is as important as the table data, which is especially true if your data distribution is difficult to calculate in advance or your regions are manually split. It is important because live backup options, including the CopyTable
and Export
utilities use the normal HBase client API to restore data in a MapReduce job. The restoring speed can be improved dramatically if we precreate well-split regions before running the restore MapReduce job.
We will...
HBase supports cluster replication, which is a way to copy data between the HBase clusters. For example, it can be used as a way to easily ship edits from a real-time frontend cluster to a batch purpose cluster on the backend.
The basic architecture of an HBase replication is very practical. The master cluster captures write ahead log (WAL), and puts replicable Key/Values (edits of the column family with replication support) from the log into the replication queue. The replication message is then sent to the peer cluster, and then replayed on that cluster using its normal HBase client API. The master cluster also keeps the current position of the WAL being replicated in ZooKeeper for failure recovery.
Because the HBase replication is done asynchronously, the clusters participating in the replication can be geographically distant. It is not a problem if the connections between them are offline for some time, as the master cluster will track the replication, and recover...