Packt+ | Advance your knowledge in tech

You're reading from Hadoop 2.x Administration Cookbook

Product type Book

Published in May 2017

Publisher Packt

ISBN-13 9781787126732

Pages 348 pages

Edition 1st Edition

Languages

Concepts

System Administration

Author (1):

Aman Singh

Table of Contents (20) Chapters

Hadoop 2.x Administration Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Hadoop Architecture and Deployment

Maintaining Hadoop Cluster HDFS

Maintaining Hadoop Cluster – YARN and MapReduce

High Availability

Schedulers

Backup and Recovery

Data Ingestion and Workflow

Performance Tuning

HBase Administration

Cluster Planning

Troubleshooting, Diagnostics, and Best Practices

Security

Index

Chapter 9. HBase Administration

In this chapter, we will cover the following recipes:

Setting up single node HBase cluster
Setting up multi-node HBase cluster
Inserting data into HBase
Integration with Hive
HBase administration commands
HBase backup and restore
Tuning HBase
HBase upgrade
Migrating data from MySQL to HBase using Sqoop

Introduction

Apache HBase is a non-relational distributed, scalable key-value data store. It provides random read/write, real-time access to HDFS.

In this chapter, we will configure the various modes of the HBase cluster. In simple terms, it is a Hadoop database based on column families with massive scale. The important thing to note is that having a column family does not make it column oriented or NoSQL. There is a common misconception, where many refer to HBase as a column-oriented database even though it isn't, and secondly, a column-oriented database is not necessarily a NoSQL database.

In this chapter, we will cover the HBase cluster configuration, backup, restore, and upgrade processes.

Setting up single node HBase cluster

In this recipe, we will see how to set up an HBase single node cluster and its components. Apache HBase works on the basis of client server architecture with an HBase master and slaves known as region servers. The HBase master can be co-located with the Namenode, but it is recommended to run it on a dedicated node. The region servers will run on Datanodes.

In this recipe, we are just setting up a single node HBase cluster with the HBase master, a region server running on a single node with Namenode, and Datanode daemons.

Getting ready

Before going through the recipes in this chapter, make sure you have gone through the steps to install the Hadoop cluster with HDFS and YARN enabled. We need a single node for this recipe, so make sure you choose a node with decent configuration.

We are using a standalone ZooKeeper in this recipe or you can point it to the already configured ZooKeeper ensemble from the previous Hive with ZooKeeper recipe in Chapter 7, Data Ingestion...

Setting up multi-node HBase cluster

In this recipe, we will configure an HBase fully distributed cluster with ZooKeeper quorum formed by three nodes. This is the recommended configuration for the production environment.

Getting ready

The user is expected to complete the previous recipe and must have completed the recipes for setting up Hive with ZooKeeper. In this recipe, we will be using the already configured ZooKeeper ensemble.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and change to the user hadoop.
Stop any daemons running from the previous HBase recipe.
Make sure the HBase package is downloaded, extracted, and the environment variables are set up as discussed in the previous recipe.
To confirm the setup, execute the commands as shown in the following screenshot:

Edit the hbase-site.xml file, as shown here:

<property>
    <name>hbase.master</name>
    <value>master1.cyrus.com:60000</value>
</property>

<property>
   ...

Inserting data into HBase

In this recipe, we will insert data into HBase and see how it is stored. The syntax for import data is not similar to SQL, as there are no select or insert statements. To insert data, we use put and scan for select.

Getting ready

Before going through the recipe in this section, make sure you have completed the previous recipe, Setting up multi-node HBase cluster.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch to the user hadoop.
Connect to the HBase shell using the hbase shell command. You can connect to the shell in interactive mode or script it.
Create a table as shown in the following screenshot:
Insert data using the commands shown in the following screenshot:
You can list the tables and scan a table, as shown in the following screenshot:

Commands can be passed in non-interactive mode, as shown in the following screenshot:

$ echo "scan 'test'" | hbase shell
$ echo "put 'test', 'r3', 'cf:3', 'val3'" | hbase shell

We can use the...

Integration with Hive

In this recipe, we look at how we can integrate Hive with HBase and use Hive to perform all the data operations.

You will have realized from the previous recipe that it gets cumbersome to perform queries using just the native HBase commands.

Getting ready

Before going through the recipe, you must have completed the Hive metastore using MySQL recipe in Chapter 7, Data Ingestion and Workflow, and the Setting up multi-node HBase cluster recipe.

How to do it...

Connect to the edge1.cyrus.com edge node in the cluster and switch to the user hadoop.
We will create an external Hive table and point it to the HBase using the ZooKeeper ensemble.

Create a table in HBase if it is not there already, as shown next:

hbase> create 'hivetable', 'ratings'
put 'hivetable', 'row1', 'ratings:userid', 'user1'
put 'hivetable', 'row1', 'ratings:bookid', 'book1'
put 'hivetable', 'row1', 'ratings:rating', '1'

Connect either using a hive or beeline client and map by creating a table, as shown next...

HBase administration commands

In this recipe, we will look at HBase administration commands, which are very useful for troubleshooting and managing the cluster.

Being an HBase administrator in a company, one needs to perform backup, recovery, troubleshooting, tuning, and many other complex things. It is good to know about the commands to make intuitive decisions.

Getting ready

To complete the recipe, you must have a running HBase cluster and must have completed the Setting up HBase multi-node cluster recipe.

How to do it...

Connect to the master1.cyrus.com master node in the cluster and switch to the user hadoop. Note that we can connect to any node in the cluster or use HBase client for connections.
Connect to the HBase shell prompt and execute the commands shown in the next few steps to get familiar with HBase:
```
$ hbase shell
  hbase> list
  hbase> version
  hbase> whoami
```
We can drop the column family or disable a table as shown next. Do not execute these on production database, unless...

HBase backup and restore

In this recipe, we will look at HBase backup and restore. We have discussed in the Namenode high availability section about the importance of backup, despite having HA. We will look at ways to take snapshots and restore it.

Getting ready

For this recipe, you must have completed the Setting up multi-node Hbase cluster recipe and have a basic understanding of the backup principles.

How to do it...

Connect to the master1.cyrus.com HBase master node and switch to the user hadoop.
Execute the following command to take a snapshot of a particular table. In this case, the table name is test:
```
$ hbase snapshot create -n test_30march -t test
```
To list the snapshots, connect the HBase shell as shown in the following screenshot:
We can restore the snapshot using the following command—the table must be disabled for the restore:
```
  hbase> disable 'test_30march'
  hbase> restore_snapshot 'test_30march'
```
To clone the table, we can restore it to a new table. This is good way of testing...

Tuning HBase

In this recipe, we will look at HBase tuning and some things to keep in mind. This is not an exclusive list of things to do, but it is a good starting point.

You are recommended to read through Chapter 8, Performance Tuning, before going through the tuning aspects of HBase cluster. It will give you a better insight into tuning the operating system, network, disk, and so on.

Getting ready

Make sure that you have completed the Setting up multi-node HBase cluster recipe for this section and understand the basic Linux commands.

How to do it...

Connect to the master1.cyrus.com master node and switch to the user hadoop.
Edit the hbase-env.sh file and add the following lines to it to tune the heap as per the work load:
```
export HBASE_HEAPSIZE=1G
export HBASE_OFFHEAPSIZE=1G
```
We can tune the Java GC algorithm. The default is marked sweep and if the GC times is high, this should be changed to Parallel GC:
```
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
```
If we are using Java 8, then the PermSize option...

HBase upgrade

In this recipe, we will cover how to upgrade to the latest stable release, which at the time of writing this recipe is version 1.2.5.

In any organization, it is important to keep it patched and updated to the latest release to address any bug fixes and issues.

But upgrading is not always that easy, as it may involve downtime and one version might not support the old metadata structure. An example of this is the old Hfile v1 format.

Getting ready

For this recipe, make sure you have an HBase cluster running and that you understand the regions and its communication.

How to do it...

Connect to the master1.cyrus.com master node and switch to the user hadoop.
Before performing any upgrades, it is important to verify that we have backup in place and that the cluster is in a consistent state.
Download the latest HBase release, as we did initially and update the symlink to point to the new version, as shown here:
```
$ unlink hbase
$ ln –s hbase-1.2.5-bin hbase
```
Do the same on all the nodes in the...

Migrating data from MySQL to HBase using Sqoop

In this recipe, we will cover how to migrate data from MySQL to HBase. This could be a very common use case in any organization that has been using RDMS and wants to move to HBase.

An important thing to understand here is that the intent of migration is not a replacement for the traditional RDBMS system, but to complement it.

To do this operation we will be using Sqoop to import data from RDBMS to Hadoop. The destination could be the HDFS filesystem, Hive, or HBase.

Getting ready

Before going through the recipe, you must have completed the Hive metastore using MySQL recipe in Chapter 7, Data Ingestion and Workflow, and the Setting up multi-node HBase cluster recipe. Make sure the services for HBase, YARN, and HDFS are running as shown in the following screenshot:

How to do it...

Connect to the edge1.cyrus.com edge node and switch to the user hadoop.
Although you can do these steps on any node, all clients are installed on the edge nodes.
Firstly, we...