Configuring HBase

In this article by Ruchir Choudhry, the author of the book HBase High Performance Cookbook, we will cover the configuration and deployment of HBase.

(For more resources related to this topic, see here.)

Introduction

HBase is an open source, nonrelational, column-oriented distributed database modeled after Google's Cloud BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project, and it runs on top of Hadoop Distributed File System (HDFS), providing BigTable-like capabilities for Hadoop. It's a column-oriented database, which is empowered by a fault-tolerant distributed file structure knows as HDFS. In addition to this, it also provides advanced features, such as auto sharding, load balancing, in-memory caching, replication, compression, near real-time lookups, strong consistency (using multiversions), block caches, and bloom filters for real-time queries and an array of client APIs.

Throughout the chapter, we will discuss how to effectively set up mid and large size HBase clusters on top of the Hadoop and HDFS framework. This article will help you set up an HBase on a fully distributed cluster.

For the cluster setup, we will consider redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, which will have six nodes.

Configuration and Deployment

Before we start HBase in a fully distributed mode, we will first be setting up Hadoop-2.4.0 in a distributed mode, and then, on top of a Hadoop cluster, we will set up HBase because it stores data in Hadoop Distributed File System (HDFS).

Check the permissions of the users; HBase must have the ability to create a directory.

Let's create two directories in which the data for NameNode and DataNode will reside:

drwxrwxr-x 2 app app 4096 Jun 19 22:22 NameNodeData
drwxrwxr-x 2 app app 4096 Jun 19 22:22 DataNodeData
-bash-4.1$ pwd
/u/HbaseB/hadoop-2.4.0
-bash-4.1$ ls -lh
total 60K
drwxr-xr-x 2 app app 4.0K Mar 31 08:49 bin
drwxrwxr-x 2 app app 4.0K Jun 19 22:22 DataNodeData

drwxr-xr-x 3 app app 4.0K Mar 31 08:49 etc

Getting Ready

Following are the steps to install and configure HBase:

  1. The first step to start is to choose a Hadoop cluster.
  2. Then, get the hardware details required for it.
  3. Get the software required to perform the setup.
  4. Get the OS required to do the setup.
  5. Perform the configuration steps.

We will require the following components for NameNode:

Components

Details

Type of systems

An operating system

redhat-6.2 Linux 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 GNU/Linux, or other standard linux kernel.

 

Hardware/CPUS

16 to 24 CPUS cores.

NameNode /Secondry NameNode.

Hardware/RAM

64 to 128 GB. In special cases, 128 GB to 512 GB RAM.

NameNode/Secondry NameNodes.

Hardware/storage

Both NameNode servers should have highly reliable storage for their namespace storage and edit log journaling. Typically, hardware RAID and/or reliable network storage are justifiable options. Note that the previous commands including an onsite disk replacement option in your support contract so that a failed RAID disk can be replaced quickly.

NameNode/Secondry Namenodes.

 

RAID: Raid is nothing but a Random Access Inexpensive Drive or Independent Disk; there are many levels of RAID drives, but for Master or NameNode, RAID-1 will be enough.

JBOD: This stands for Just a Bunch of Disk. The design is to have multiple hard drives stacked over each other with no redundancy. The calling software needs to take care of the failure and redundancy. In essence, it works as a single logical volume.

The following screenshot shows the working mechanism of RAID and JBOD:

Before we start for the cluster setup, a quick recap of the Hadoop setup is essential, with brief descriptions.

How to do it…

Let's create a directory where you will have all the software components to be downloaded:

  1. For simplicity, let's take this as /u/HbaseB.
  2. Create different users for different purposes. The format will be user/group; this is essentially required to differentiate various roles for specific purposes:
    • HDFS/Hadoop: This is for the handling of Hadoop-related setups
    • Yarn/Hadoop: This is for Yarn-related setups
    • HBase/Hadoop
    • Pig/Hadoop
    • Hive/Hadoop
    • Zookeeper/Hadoop
    • HCat/Hadoop
  3. Set up directories for the Hadoop cluster: let's assume /u as a shared mount point; we can create specific directories, which will be used for specific purposes:
    -bash-4.1$ ls -ltr
    total 32
    drwxr-xr-x 9 app app 4096 Oct 7 2013 hadoop-2.2.0
    drwxr-xr-x 10 app app 4096 Feb 20 10:58 zookeeper-3.4.6
    drwxr-xr-x 15 app app 4096 Apr 5 08:44 pig-0.12.1
    drwxrwxr-x 7 app app 4096 Jun 30 00:57 hbase-0.98.3-hadoop2
    drwxrwxr-x 8 app app 4096 Jun 30 00:59 apache-hive-0.13.1-bin
    drwxrwxr-x 7 app app 4096 Jun 30 01:04 mahout-distribution-0.9

    Make sure that you have adequate privileges in the folder to add, edit, and execute a command. Also, you must set up password-less communication between different machines, such as from the name node to DataNode and from HBase Master to all the region server nodes. Refer to this webpage to learn how to do this: http://www.debian-administration.org/article/152/Password-less_logins_with_OpenSSH.

  4. Here, we will list the procedure to achieve the end result of the recipe. This section will follow a numbered bullet form. We do not need to explain the reason we are following a procedure. Numbered single sentences will do fine.
  5. Let's assume there is a /u directory and you have downloaded the entire stack of software from /u/HbaseB/hadoop-2.2.0/etc/hadoop/; look for the core-site.xml file. Place the following lines in this file:
    configuration>
    <property>
       <name>fs.default.name</name>
       <value>hdfs://mynamenode-hadoop:9001</value>
       <description>The name of the default file system.
       </description>
    </property>
    </configuration>
    You can specify a port that you want to use; it should not clash with the ports that are already in use by the system for various purposes. A quick look at this link can provide more specific details about this; complete detail on this topic is out of the scope of this book. You can refer to http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers.
  6. Save the file. This helps us create a master/NameNode directory.
  7. Now let's move on to set up secondary nodes. Edit /u/HbaseB/hadoop-2.4.0/etc/hadoop/ and look for the core-site.xml file:
    <configuration>
    <property>
       <name>fs.checkpoint.dir</name>
       <value>/u/dn001/hadoop/hdf/secdn
              /u/dn002/hadoop/hdfs/secdn
       </value>
       <description>A comma separated list of paths. Use the
    list of directories from $FS_CHECKPOINT_DIR.
    example, /u/dn001/hadoop/hdf/secdn,/u/dn002/hadoop/hdfs/secd
    n
       </description>
    </property>
    </configuration>
    The separation of the directory structure is for the purpose of the clean separation of the hdfs block separation and to keep the configurations as simple as possible. This also allows us to do proper maintenance.
  8. Now let's move toward changing the setup for hdfs; the file location will be /u/HbaseB/hadoop-2.4.0/etc/hadoop/hdfs-site.xmlfor NameNode:
    <property>
       <name>dfs.name.dir</name>
       <value>
           /u/nn01/hadoop/hdfs/nn/u/nn02/hadoop/hdfs/nn
       </value>
       <description>
           Comma separated list of path, Use the list of
    directories
       </description>
    </property>
    for DataNode:
    <property>
       <name>dfs.data.dir</name>
       <value>/u/dnn01/hadoop/hdfs/dn,/u/dnn02/hadoop/hdfs/dn
       </value>
       <description>Comma separated list of path, Use the list of
    directories
       </description>
    </property>
  9. Now let's go for NameNode for the HTTP address or to NameNode using the HTTP protocol:
    <property>
       <name>dfs.http.address</name>
       <value>namenode.full.hostname:50070</value>
       <description>Enter your NameNode hostname for http
    access. </description>
    </property>
  10. The HTTP address for the secondary NameNode is as follows:
    <property>
       <name>dfs.secondary.http.address</name>
       <value>
           secondary.namenode.full.hostname:50090
       </value>
       <description>
           Enter your Secondary NameNode hostname.
       </description>
    </property>

    We can go for an HTTPS setup for NameNode as well, but let's keep this optional for now:

  11. Now let's look for the Yarn setup in the /u/HbaseB/ hadoop-2.2.0/etc/hadoop/ yarn-site.xml file:
    • For the resource tracker that's a part of the Yarn resource manager, execute the following code:
      <property>
       <name>yarn.resourcemanager.resourcetracker.address</name>
       <value>yarnresourcemanager.full.hostname:8025</value>
       <description>Enter your yarn Resource Manager hostname.</description>
      </property>
    • For the resource schedule that's part of the Yarn resource scheduler, execute the following code:
      <property>
      <name>yarn.resourcemanager.scheduler.address</name>
       <value>resourcemanager.full.hostname:8030</value>
       <description>Enter your ResourceManager hostname</description>
      </property>
    • For scheduler address, execute the following code:
      <property>
         <name>yarn.resourcemanager.address</name>
         <value>resourcemanager.full.hostname:8050</value>
         <description>Enter your ResourceManager hostname.</description>
      </property>
    • For scheduler admin address, execute the following code:
      <property>
         <name>yarn.resourcemanager.admin.address</name>
         <value>resourcemanager.full.hostname:8041</value>
         <description>Enter your ResourceManager hostname.</description>
      </property>
    • To set up the local directory, execute the following code:
      <property>
      <name>yarn.nodemanager.local-dirs</name>
      <value>/u/dnn01/hadoop/hdfs /yarn,/u/dnn02/hadoop/hdfs/yarn </value>
      <description>Comma separated list of paths. Use the list of directories from,.</description>
      </property>
    • To set up the log location, execute the following code:
      <property>
         <name>yarn.nodemanager.logdirs</name>
         <value>/u/var/log/hadoop/yarn</value>
         <description>Use the list of directories from $YARN_LOG_DIR. <description>
      </property>

      This completes the configuration changes required for Yarn

  12. Now let's make the changes for MapReduce. Open /u/HbaseB/ hadoop-2.2.0/etc/hadoop/mapred-site.xml.
  13. Now let's place this configuration setup in mapred-site.xml and place this between <configuration></configuration>:
    <property>
    <name>mapreduce.jobhistory.address</name>
    <value>jobhistoryserver.full.hostname:10020</value>
    <description>Enter your JobHistoryServer hostname.</description>
    </property>
  14. Once we have configured MapReduce, we can move on to configuring HBase. Let's go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf path and open the hbase-site.xml file.
  15. You will see a template that has <configuration></configurations>.
  16. We need to add the following lines between the starting and ending tags:
    <property>
       <name>hbase.rootdir</name>
       <value>hdfs://hbase.namenode.full.hostname:8020/apps/hbase/data</value>
    <description>
    Enter the HBase NameNode server hostname</description>
    </property>
    <property>
       <!—this id for binding address -->
       <name>hbase.master.info.bindAddress</name>
       <value>$hbase.master.full.hostname</value>
       <description>Enter the HBase Master server hostname</description>
    </property>

    This competes the HBase changes.

  17. ZooKeeper: Now let's focus on the setup of ZooKeeper. In distributed a environment, let's go to /u/HbaseB/zookeeper-3.4.6/conf locations, rename zoo_sample.cfg to zoo.cfg, and place the details as follows:
    yourzooKeeperserver.1=zoo1:2888:3888
    yourZooKeeperserver.2=zoo2:2888:3888
    If you want to test this setup locally, use different port combinations.

    Atomic broadcasting is an atomic messaging system that keeps all the servers in sync and provides reliable delivery, total orders, casual orders, and so on.

  18. Region servers: Before concluding, let's go to the region server setup process. Go to the /u/HbaseB/hbase-0.98.3-hadoop2/conf folder and edit the regionserver file. Specify the region servers accordingly:
    RegionServer1
    RegionServer2
    RegionServer3
    RegionServer4
  19. Copy all the configuration files of Hbase and ZooKeeper to the relative host dedicated for Hbase and ZooKeeper.
  20. Let's quickly validate the setup that we worked on:
    Sudo su $HDFS_USER
    /u/HbaseB/hadoop-2.2.0/bin/hadoop namenode -format /u/HbaseB/hadoop-2.4.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode
  21. Now let's go to the secondary nodes:
    Sudo su $HDFS_USER
    /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start secondarynamenode
  22. Now let's perform all the steps for DataNode:
    Sudo su $HDFS_USER
    /u/HbaseB/hadoop-2.2.0/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode
    Test 01>
    See if you can reach from your browser
    http://namenode.full.hostname:50070
    
    Test 02> sudo su $HDFS_USER
    /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs -copyFromLocal /tmp/hello.txt
    /u/HbaseB/hadoop-2.2.0/sbin/hadoop dfs –ls
    you must see hello.txt once the command executes.
    
    Test 03> Browse
    http://datanode.full.hostname:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/&nnaddr=$datanode.full.hostname:8020
    you should see the details on the datanode.
  23. Validate the Yarn and MapReduce setup by following these steps:
    1. Execute the command from Resource Manager:
      <login as $YARN_USER and source the directories.sh companion script> /u/HbaseB/hadoop-2.2.0/sbin /yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
    2. Execute the command from Node Manager
      <login as $YARN_USER and source the directories.sh companion script>
      /usr/lib/hadoop-yarn/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
    3. Execute the following commands:
      hadoop fs -mkdir /app-logs
      hadoop fs -chown $YARN_USER /app-logs
      hadoop fs -chmod 1777 /app-logs
      Execute MapReduce
      Sudo su $HDFS_USER
      /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done_intermediate
      /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done_intermediate
      /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -mkdir -p /mapred/history/done
      /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chmod -R 1777 /mapred/history/done
      /u/HbaseB/hadoop-2.2.0/sbin/hadoop fs -chown -R mapred /mapred
      
      export HADOOP_LIBEXEC_DIR=/u/HbaseB/hadoop-2.2.0/libexec/
      export HADOOP_MAPRED_HOME=/=/u/HbaseB/hadoop-2.2.0/hadoop-mapreduce
      export HADOOP_MAPRED_LOG_DIR==/u/HbaseB/hadoop-2.2.0//mapred
    4. Start the jobhistory servers:
       <login as $MAPRED_USER and source the directories.sh companion script>
      /u/HbaseB/hadoop-2.2.0/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
      
      Test 01: from the browser or from curl use the link to browse.
      http://resourcemanager.full.hostname:8088/
      Test 02:
      Sudo su $HDFS_USER
      /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar teragen 100 /test/10gsort/input
      /u/HbaseB/hadoop-2.2.0/bin/hadoop jar /u/HbaseB/hadoop-2.2.0/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.2.1-alpha.jar
  24. Validate the HBase setup:
    •     Login as $HDFS_USER
      /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –mkdir /apps/hbase
      
      /u/HbaseB/hadoop-2.2.0/bin/hadoop fs –chown –R /apps/hbase
    •      Now login as $HBASE_USER
      /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbas-daemon.sh –-config $HBASE_CONF_DIR start master
      this will start the master node
    •      Now let’s move to HBase Region server nodes:
      /u/HbaseB/hbase-0.98.3-hadoop2/bin/hbase-daemon.sh –config $HBASE_CONF_DIR start regionservers
      this will start the regionservers

      For single machine direct sudo ./hbase master start can also be used. Please check the logs in case of any logs.

      Now lets login using

      Sudo su- $HBASE_USER
      ./hbase shell

      will connect us to the hbase to the master.

  25. Validate the ZooKeeper setup:
    -bash-4.1$ sudo ./zkServer.sh start
    JMX enabled by default
    Using config: /u/HbaseB/zookeeper-3.4.6/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED
    
    You can also pipe the log to the ZooKeeper logs.
    /u/logs//u/HbaseB/zookeeper-3.4.6/zoo.out 2>&1

Summary

In this article, we learned how to configure and set up HBase. We set up HBase to store data in Hadoop Distributed File System.

We explored the working structure of RAID and JBOD and the differences between both filesystems.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

HBase High Performance Cookbook

Explore Title