Chapter 9. HBase Administration
In this chapter, we will cover the following recipes:
Setting up single node HBase cluster
Setting up multi-node HBase cluster
Inserting data into HBase
Integration with Hive
HBase administration commands
HBase backup and restore
Tuning HBase
HBase upgrade
Migrating data from MySQL to HBase using Sqoop
Apache HBase is a non-relational distributed, scalable key-value data store. It provides random read/write, real-time access to HDFS.
In this chapter, we will configure the various modes of the HBase cluster. In simple terms, it is a Hadoop database based on column families with massive scale. The important thing to note is that having a column family does not make it column oriented or NoSQL. There is a common misconception, where many refer to HBase as a column-oriented database even though it isn't, and secondly, a column-oriented database is not necessarily a NoSQL database.
In this chapter, we will cover the HBase cluster configuration, backup, restore, and upgrade processes.
Setting up single node HBase cluster
In this recipe, we will see how to set up an HBase single node cluster and its components. Apache HBase works on the basis of client server architecture with an HBase master and slaves known as region servers. The HBase master can be co-located with the Namenode, but it is recommended to run it on a dedicated node. The region servers will run on Datanodes.
In this recipe, we are just setting up a single node HBase cluster with the HBase master, a region server running on a single node with Namenode, and Datanode daemons.
Before going through the recipes in this chapter, make sure you have gone through the steps to install the Hadoop cluster with HDFS and YARN enabled. We need a single node for this recipe, so make sure you choose a node with decent configuration.
We are using a standalone ZooKeeper in this recipe or you can point it to the already configured ZooKeeper ensemble from the previous Hive with ZooKeeper recipe in Chapter 7, Data Ingestion...
Setting up multi-node HBase cluster
In this recipe, we will configure an HBase fully distributed cluster with ZooKeeper quorum formed by three nodes. This is the recommended configuration for the production environment.
The user is expected to complete the previous recipe and must have completed the recipes for setting up Hive with ZooKeeper. In this recipe, we will be using the already configured ZooKeeper ensemble.
Connect to the master1.cyrus.com
master node in the cluster and change to the user hadoop
.
Stop any daemons running from the previous HBase recipe.
Make sure the HBase package is downloaded, extracted, and the environment variables are set up as discussed in the previous recipe.
To confirm the setup, execute the commands as shown in the following screenshot:
Edit the hbase-site.xml
file, as shown here:
Inserting data into HBase
In this recipe, we will insert data into HBase and see how it is stored. The syntax for import data is not similar to SQL, as there are no select or insert statements. To insert data, we use put and scan for select.
Before going through the recipe in this section, make sure you have completed the previous recipe, Setting up multi-node HBase cluster.
Connect to the master1.cyrus.com
master node in the cluster and switch to the user hadoop
.
Connect to the HBase shell using the hbase shell
command. You can connect to the shell in interactive mode or script it.
Create a table as shown in the following screenshot:
Insert data using the commands shown in the following screenshot:
You can list the tables and scan a table, as shown in the following screenshot:
Commands can be passed in non-interactive mode, as shown in the following screenshot:
We can use the...
In this recipe, we look at how we can integrate Hive with HBase and use Hive to perform all the data operations.
You will have realized from the previous recipe that it gets cumbersome to perform queries using just the native HBase commands.
Before going through the recipe, you must have completed the Hive metastore using MySQL recipe in Chapter 7, Data Ingestion and Workflow, and the Setting up multi-node HBase cluster recipe.
Connect to the edge1.cyrus.com
edge node in the cluster and switch to the user hadoop
.
We will create an external Hive table and point it to the HBase using the ZooKeeper ensemble.
Create a table in HBase if it is not there already, as shown next:
Connect either using a hive or beeline client and map by creating a table, as shown next...
HBase administration commands
In this recipe, we will look at HBase administration commands, which are very useful for troubleshooting and managing the cluster.
Being an HBase administrator in a company, one needs to perform backup, recovery, troubleshooting, tuning, and many other complex things. It is good to know about the commands to make intuitive decisions.
To complete the recipe, you must have a running HBase cluster and must have completed the Setting up HBase multi-node cluster recipe.
Connect to the master1.cyrus.com
master node in the cluster and switch to the user hadoop
. Note that we can connect to any node in the cluster or use HBase client for connections.
Connect to the HBase shell prompt and execute the commands shown in the next few steps to get familiar with HBase:
We can drop the column family or disable a table as shown next. Do not execute these on production database, unless...
In this recipe, we will look at HBase backup and restore. We have discussed in the Namenode high availability section about the importance of backup, despite having HA. We will look at ways to take snapshots and restore it.
For this recipe, you must have completed the Setting up multi-node Hbase cluster recipe and have a basic understanding of the backup principles.
Connect to the master1.cyrus.com
HBase master node and switch to the user hadoop
.
Execute the following command to take a snapshot of a particular table. In this case, the table name is test
:
To list the snapshots, connect the HBase shell as shown in the following screenshot:
We can restore the snapshot using the following command—the table must be disabled for the restore:
To clone the table, we can restore it to a new table. This is good way of testing...
In this recipe, we will look at HBase tuning and some things to keep in mind. This is not an exclusive list of things to do, but it is a good starting point.
You are recommended to read through Chapter 8, Performance Tuning, before going through the tuning aspects of HBase cluster. It will give you a better insight into tuning the operating system, network, disk, and so on.
Make sure that you have completed the Setting up multi-node HBase cluster recipe for this section and understand the basic Linux commands.
Connect to the master1.cyrus.com
master node and switch to the user hadoop
.
Edit the hbase-env.sh
file and add the following lines to it to tune the heap as per the work load:
We can tune the Java GC algorithm. The default is marked sweep and if the GC times is high, this should be changed to Parallel GC:
If we are using Java 8, then the PermSize option...
In this recipe, we will cover how to upgrade to the latest stable release, which at the time of writing this recipe is version 1.2.5.
In any organization, it is important to keep it patched and updated to the latest release to address any bug fixes and issues.
But upgrading is not always that easy, as it may involve downtime and one version might not support the old metadata structure. An example of this is the old Hfile v1 format.
For this recipe, make sure you have an HBase cluster running and that you understand the regions and its communication.
Connect to the master1.cyrus.com
master node and switch to the user hadoop
.
Before performing any upgrades, it is important to verify that we have backup in place and that the cluster is in a consistent state.
Download the latest HBase release, as we did initially and update the symlink to point to the new version, as shown here:
Do the same on all the nodes in the...
Migrating data from MySQL to HBase using Sqoop
In this recipe, we will cover how to migrate data from MySQL to HBase. This could be a very common use case in any organization that has been using RDMS and wants to move to HBase.
An important thing to understand here is that the intent of migration is not a replacement for the traditional RDBMS system, but to complement it.
To do this operation we will be using Sqoop to import data from RDBMS to Hadoop. The destination could be the HDFS filesystem, Hive, or HBase.
Before going through the recipe, you must have completed the Hive metastore using MySQL recipe in Chapter 7, Data Ingestion and Workflow, and the Setting up multi-node HBase cluster recipe. Make sure the services for HBase, YARN, and HDFS are running as shown in the following screenshot:
Connect to the edge1.cyrus.com
edge node and switch to the user hadoop
.
Although you can do these steps on any node, all clients are installed on the edge nodes.
Firstly, we...