HBase Design Patterns

By Mark Kerzner , Sujee Maniyam
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Starting Out with HBase

About this book

With the increasing use of NoSQL in general and HBase in particular, knowing how to build practical applications depends on the application of design patterns. These patterns, distilled from extensive practical experience of multiple demanding projects, guarantee the correctness and scalability of the HBase application. They are also generally applicable to most NoSQL databases.

Starting with the basics, this book will show you how to install HBase in different node settings. You will then be introduced to key generation and management and the storage of large files in HBase. Moving on, this book will delve into the principles of using time-based data in HBase, and show you some cases on denormalization of data while working with HBase. Finally, you will learn how to translate the familiar SQL design practices into the NoSQL world. With this concise guide, you will get a better idea of typical storage patterns, application design templates, HBase explorer in multiple scenarios with minimum effort, and reading data from multiple region servers.

Publication date:
December 2014
Publisher
Packt
Pages
150
ISBN
9781783981045

 

Chapter 1. Starting Out with HBase

We have already explained the advantages of HBase in the preface of this book, where we also dealt with the history and motivation behind HBase. In this chapter, we will make you comfortable with HBase installation. We will teach you how to do the following:

  • Create the simplest possible, but functional, HBase environment in a minute

  • Build a more realistic one-machine cluster

  • Be in full control with clusters of any size

Now it is time for our first cartoon; we will have one for every chapter. This is you, the reader, eager to know the secrets of HBase. With this picture, we understand that learning is a never-ending process.

The simplest way to get HBase up and running takes only a few minutes, and it is so easy that it will almost feel like a cheat, but it's not—you will have a complete HBase development environment.

However, we cannot leave you there. If all you know is building toy clusters, even your most sophisticated HBase design will not command full authority. A pilot needs to know his plane inside out. That is why we will continue with larger and larger cluster installs until you are able to store the planet's data.

Now follow these steps:

  1. Download Kiji from http://www.kiji.org/.

  2. Source their environment as per the instructions.

  3. Start the cluster.

  4. Download Apache Phoenix from http://phoenix.apache.org/download.html, copy its JAR file into the HBase lib directory, and restart the cluster.

Presto! You now have a full HBase development environment that will last you until the end of the book. But remember, to be taken seriously, you need to read till the end and build your clusters.

 

Installing HBase


The first thing that you need to do when learning to deal with HBase is to install it. Moreover, if you are in any way serious about using HBase, you will want to run it, not on a machine, but on a cluster.

Devops (development and operations) is a modern trend in software development that stresses communication, collaboration, and integration between software developers and information technology professionals. It blurs the distinction between a developer and an administrator.

You will have to live up to this movement in person, by doing what is called Build Your Own Clusters (BYOC). Developers will do well by learning what administrators know, at least to some degree, and the same is true about administrators learning about what developers do.

Why is that? This is primarily because efficiency is the name of the game. The reason you choose a NoSQL solution in the first place is for efficiency. Your thoughts around the design will be all around efficiency, that is, the design of the keys, the tables, and the table families. In fact, with NoSQL, you are, for the first time, given the means of reasoning about the efficiency of your solutions, to quote from the Google paper on BigTable, while doing the logical layout of your data.

So, you will be working on single-node clusters. That's true, but you will always be checking your solutions on larger clusters. Still, we need to start with baby steps.

Note

A warning

I will show you the work in every detail. I often teach courses, and I have noticed how often, users get stuck on minor details. This is because of miscommunication. Once you build your first hundred clusters, it all seems so easy that you forget how hard it was in the beginning, and you tend to skip the details.

Creating a single-node HBase cluster

The easiest way to install a single-node HBase cluster starts by going to the HBase Apache website at http://hbase.apache.org/.

Here is what you see there:

The following steps need to be followed:

  1. Click on Downloads.

  2. Choose the mirror that is recommended or try one of your special considerations (if you have a preference), and you will be taken to the following page:

  3. Choose the latest distribution, download, and unzip it.

  4. Verify the MD5 hash (do not ignore the good software development practices).

  5. Unzip the distribution and change your current directory to the location of the unzipped file as follows:

    [email protected]:~/Downloads$ gunzip hbase-0.96.1.1-hadoop1-bin.tar.gz
    [email protected]:~/Downloads$ tar xf hbase-0.96.1.1-hadoop1-bin.tar
    [email protected]:~/Downloads$ cd hbase-0.96.1.1-hadoop1/
    
  6. Now, start HBase as follows:

    $ ./bin/start-hbase.sh
    starting Master, logging to logs/hbase-user-master-example.org.out
    

That's it! You are good to go.

You might get the following message:

+===================================================================
|      Error: JAVA_HOME is not set and Java could not be found         |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site        |
|       > http://java.sun.com/javase/downloads/ <                      |
|                                                                      |
| HBase requires Java 1.6 or later.                                    |
| NOTE: This script will find Sun Java whether you install using the   |
|       binary or the RPM based installer.                             |
+===================================================================

If this happens to you, set JAVA_HOME in the file conf/hbase-env.sh; for example, as follows:

export JAVA_HOME=/usr/lib/jvm/j2sdk1.6-oracle

Otherwise, you can set the JAVA_HOME variable in your environment.

Verify your HBase install. Nothing much can go wrong at this stage, so running the HBase shell, as follows, should be enough:

[email protected]:~$ hbase shell
13/12/29 14:47:12 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6

hbase(main):001:0> status
1 servers, 0 dead, 11.0000 average load

hbase(main):002:0> list
TABLE                                                                           
EntityGroup                                                                     
mark_users                                                                      
mytable                                                                         
wordcount                                                                       
4 row(s) in 0.0700 seconds

hbase(main):003:0> exit
[email protected]:~$ 

Except this, you should have no tables at this stage; I just showed you mine.

Now we are ready to go for the real stuff—building a cluster.

Creating a distributed HBase cluster

I am planning to use Cloudera Manager (CM) for this, and I will build it on Amazon Web Services (AWS) using the Elastic Compute Cloud (EC2) web service.

Granted there are other distributions, and there is also your own hardware. However, let me tell you that your own hardware could cost: $26,243.70. This is how much is cost me.

That's just for three machines, which is barely enough for Hadoop work, and with HBase, you might as well double the memory requirements. Although, in the long run, owning your hardware is better than renting, just as the case is with most of the own versus rent scenarios. However, you might prefer to rent Amazon machines, at a fraction of a dollar per hour and for a few minutes of provisioning time.

Now follow these steps:

  1. Go to the AWS website at http://aws.amazon.com/console/.

  2. Once you are there, log in to the AWS console.

  3. Navigate to EC2, the Amazon web service.

  4. Launch your first instance as follows:

  5. Choose Ubuntu Server 12.04.3 LTS with long-term support.

    Choosing Ubuntu Server 12.04.3 LTS with long-term support

Why choose the relatively old version? That is because with Hadoop and HBase, there is no shame in sticking to old versions. There is a good reason for that. The burn-out period for Hadoop is years, running on thousands of machines. So, although you, as a developer, will always prefer the latest and greatest, check yourself. As the wisdom goes, "Who is strong? He who controls his own desires".

There is also another good reason to choose the older version of Ubuntu Server. Most of the Hadoop testing is done on somewhat older versions of the servers. Put yourself in the Hadoop developers' shoes; would you test on a long-term support (seven years) server first, or on the latest and greatest, which promises to put your data in the cloud and to connect you with every social network in the world?

That is why you will have less trouble with the older versions of the OS. I learnt this the hard way.

OK, so now you are convinced and are ready to start your first instance.

 

Selecting an instance


We start by choosing the m1.xlarge file.

Now, you may wonder how much this is going to cost you. First of all, let me give you a convenient table that summarizes AWS costs. Without it, you would have to browse the Amazon services for quite a while. Follow the link at http://www.ec2instances.info/, and here is what you see there. The following table is quite useful:

The AWS costs table

Secondly, in the words of a well-known Houston personality called Mac (or the Mattress Mac)—"I am going to save you money!" (to know more, follow this link at http://tinyurl.com/kf5vhcg). Do not click on the Start the instance button just yet.

Spot instances

In addition to an on-demand instance, Amazon features what are called spot instances. These are machine hours traded on the market, often at one-tenth the price. So, when you are ready to launch your instances, check the Request spot instances option, just as I have done in the following screenshot:

Here are my savings—the m1.xlarge instance costs 48 cents an hour when purchased on demand, but its current market price is about 5 cents, about 10 times cheaper. I am setting the maximum offered price at 20 cents, and that means I will be paying the fluctuating market price, starting from 5 cents and possibly up, but not more than 20 cents per hour.

I do not delude myself; I know that the big fish (that is, the big EC2-based companies) are hunting for those savings, and they apply sophisticated trading techniques, which sometimes result in strange pricing, exceeding the on-demand pricing. Be careful with the maximum limit you set. But for our limited purposes of practicing cluster construction, we should swim under the belly of the big fish and be just fine.

 

Adding storage


The next step is selecting the storage. The Ubuntu image comes with 8 GB of root drive, and that is too little for anything; choose 30 GB for now. Remember that each 1 GB costs 5 cents per month at current prices, so for hours, and even days, that is negligible.

By now, you might be asking yourself, where does Mark know all this from? I will give you the two references now, and I will also repeat them at the end of the chapter (for the benefit of those who like to peek at the end). As I have already told you, I run Hadoop/HBase training, and our labs are all open source. Please have a look at https://github.com/hadoop-illuminated/HI-labs for more details. More specifically, in the admin labs, in the Managers section (and that means Hadoop, not human managers), you will find the instructions in brief (https://github.com/hadoop-illuminated/HI-labs/tree/master/hadoop-admin/managers/cloudera-cm). In turn, it refers to the Cloudera blog post found at http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/. However, none of these instructions are as complete as this chapter, so save them for future use.

 

Security groups


Now it is time to set up a security group. Here is my hadoop security group. Please note that all servers within this group can communicate with each other on every port. For the outside ports, I have opened those that are required by Cloudera Manager, and by the Hadoop UI for HDFS and MapReduce. Here is me selecting this group for Cloudera Manager I will be using to install the rest of the cluster:

This is my hadoop security group:

Tip

Don't let Cloudera Manager create a group for you—it is better to create it yourself and keep using it.

 

Starting the instance


Now, we give a final touch as shown in the following screenshot:

Launching the instance

Choose the key. Again, don't let CM create the key for you. Ask Amazon, and store the key in a secure place.

Hold on, we are almost done. Now, let's start 10 more instances that will be used for cluster construction. There are two reasons why I start them myself rather than asking CM to start them for me. Firstly, it results in saving money. I will start spot instances, whereas CM can only start on-demand ones. Secondly, it has better control. If something does not work, I can see it much better than CM can.

You are familiar by now with most of the steps. Except that this time, I am starting 10 instances at once, saving a lot of money in the process.

.

These 10 instances will be the workhorses of the cluster, so I will give them enough root space, that is, 100 GB. The CM is smart enough to get the ephemeral storage (about 5 TB) and make it a part of the HDFS. The result will approximately be a 5-TB cluster for one dollar per hour. Here are all of these pending requests:

A few minutes later, here they all are again, with spot requests fulfilled and servers running.

Fulfilled requests and running servers

Now comes your part — building the cluster. Remember, so far Amazon has been working for you, you just provided the right foundation.

Now, log in to the CM machine as follows:

ssh -i .ssh/<your-key-here.pem> [email protected]<cm-url>

The key is what you saved when EC2 created the key pair for you, and <cm-url> is the URL of the server where you run the Cloudera Manager. Note that I carefully assign the servers their names. Soon, you will have many servers running, and if you don't mark them, it will get confusing. Now, start the install using the following command:

wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin

CM will take you through a series of screenshots, and you will have to accept a couple of licenses. There are no choices and no gotchas here, so I am showing only one intermediate screen:

After this is done, give it a minute to start the web server. Then go to <cm-url>:7180. In my case, this looks as follows:

Log in with both Username and Password as admin. Accept the free license and continue to the Hosts screen. Now is probably the most important selection. Get the private DNS for every host in the cluster and put it into the Check Hosts window.

One last note, and then I will let you kids go play with your cluster and fly solo. Why is it so important to choose the internal IP, also called the private DNS? Firstly, because you won't be charged for every request. Normally, you get charged for every request and transfer, but for internal transfers, this charge is zero, that is free — nada! Secondly, recall that in our security group, all servers are allowed to communicate with all other servers on all ports. So you won't have any problems setting up your clusters, regardless of which ports the Hadoop services decide to communicate on. If you don't do that, the install will fail on the next step. However, if everything is correct, you will get this happy screen:

Give it the right username (in my case, it is ubuntu) and the right key on the next screen. I can rely on you to do it right, as these are the same username and key that you used to log into the Cloudera Manager server. If you could do that, you will be able to do this as well.

Tip

Don't leave your monitor unattended, so keep clicking at the right times. If you don't, the CM session will time out and you won't be able to restart the install. All the work will be lost; you will have to shut all the servers down and restart them. You've been warned, so get your coffee ready before you start!

It is not uncommon for some servers to fail to start. This is normal in clusters and in a networked environment. CM will drop the servers that fail to start for any reason and continue with what it has.

Tip

As a wise man said, "Who is rich? One who is happy with what he has."

On one of the next screens, do not forget to request HBase as part of the real-time delivery. There is no deep meaning to this, just marketing, as it is you who will provide the actual real-time delivery with HBase and your code.

Finally, enjoy your new cluster, kick the tires, look around, try to look at every service that is installed, analyze each individual host, and so on. You can always come back home by clicking on the Home or Cloudera button at the top-left of the screen.

Log in to the cluster. Any of the 11 servers, including CM, is good for that, because the Gateway service is installed on each one of them. In my case, the login command looks as follows:

[email protected]:~$ ssh -i .ssh/shmsoft-hadoop.pem [email protected]

Once there, I can look around HDFS as follows:

[email protected]:~$ hdfs dfs -ls /
Found 3 items
drwxr-xr-x   - hbase hbase               0 2014-12-30 03:41 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2014-12-30 03:45 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2014-12-30 03:43 /user

However, if you try to create your home directory, it won't work:

hdfs dfs -mkdir /user/ubuntu
mkdir: Permission denied: user=ubuntu, access=WRITE, inode=""/user"":hdfs:supergroup:drwxr-xr-x

To fix this, you need to do the following (as described at https://github.com/hadoop-illuminated/HI-labs/tree/master/hadoop-admin/managers/cloudera-cm):

[email protected]:~$ sudo -u hdfs   hdfs dfs -mkdir   /user/ubuntu
[email protected]:~$ sudo -u hdfs  hdfs dfs -chown ubuntu /user/ubuntu
[email protected]:~$ hdfs dfs -mkdir  /user/ubuntu/mark

Now you have your home, and in fact, your user directory (mark in my case, so that I can see it):

[email protected]:~$ hdfs dfs -ls
Found 1 items
drwxr-xr-x   - ubuntu supergroup          0 2014-12-30 04:03 mark

Moreover, I can even put files there. For example, I can put my install file in mark/, as follows:

hdfs dfs -put cloudera-manager-installer.bin mark/

And, lo and behold, I can see that file:

hdfs dfs -ls mark
Found 1 items
-rw-r--r--   3 ubuntu supergroup     501703 2014-12-30 04:04 mark/cloudera-manager-installer.bin

Now, two last tricks of the trade. The first is to view the HDFS UI, and the second is to open it in the browser or on the command line:

w3m http://ec2-54-205-27-58.compute-1.amazonaws.com:50070

If you use the internal IP (which you can find on the AWS console), as follows, then you will not be blocked by the firewall and you will be able to browse at any level:

w3m http://10.180.188.14:50070

If you want to see the HBase, it is found here:

w3m http://10.180.188.14:60010

If you have any questions, use the Hadoop illuminated forum found at http://hadoopilluminated.com/ to ask the authors or your peers.

You will also have many choices of Hadoop distribution and run environments. We have summarized them in the following cartoon:

 

Summary


In this chapter, I convinced you that you will be able to build your own HBase clusters, and then spent a large part of the chapter walking you through this process. Please follow the steps precisely! Many hints that are found here are very important, and without them, the cluster will not work.

Once you are at ease with the basic construction, you will be able to strike on your own, change the ways in which you build those clusters, and eventually come up with something new and unexpected.

Please keep in mind that we used the Cloudera Hadoop distribution as a basis for all the instructions. You are not limited to this; you have a choice. The Apache BigTop project is your independent alternative (http://bigtop.apache.org/). HortonWorks and MapR also offer distributions with their managers. All of them provide the same excellent Hadoop distribution. In this book, I wanted to give you a clear set of instructions that worked for me.

For the comparison of different Hadoop distributions, please refer to Chapter 11, Distributions, of our open source book Hadoop Illuminated (http://hadoopilluminated.com/hadoop_illuminated/Distributions.html). If you are interested in the precise installation instructions for other distributions, watch out for our Hadoop illuminated labs at https://github.com/hadoop-illuminated/HI-labs. Eventually, all the distributions will be described there, in the admin labs.

Note that there are exactly 33 pictures in this chapter. This, of course, is no coincidence.

Recall the poem by Omar Khayyám, which tells you that there are no coincidences in this world:

"The Moving Finger writes; and, having writ,

Moves on: nor all thy Piety nor Wit

Shall lure it back to cancel half a Line,

Nor all thy Tears wash out a Word of it."

One can argue whether tears can or cannot erase our mistakes, but in this world of clusters, we can always try by repeating the steps again and again.

In the next chapter, we will discuss using Java code to read from and write to HBase. We will also see how we can control HBase with the help of the HBase shell. The most important thing we will learn is to operate through SQL statements, in a manner familiar to all SQL database users.

About the Authors

  • Mark Kerzner

    Mark Kerzner holds degrees in law, math, and computer science. He has been designing software for many years and Hadoop-based systems since 2008. He is a cofounder of Elephant Scale LLC, a big data training and consulting firm, as well as the co-author of the open source book Hadoop Illuminated. He has authored other books and patents as well. He knows about 10 languages and is a Mensa member.

    Browse publications by this author
  • Sujee Maniyam

    Sujee Maniyam has been developing software for 15 years. He is a hands-on expert of Hadoop, NoSQL, and cloud technologies. He is a founder and the Principal at Elephant Scale (http://elephantscale.com/), where he consults and teaches big data technologies. He has authored a few open source projects and has contributed to the Hadoop project. He is an author of the open source book Hadoop Illuminated(http://hadoopilluminated.com/).

    He is the founder of the Big Data Gurus meetup in San Jose, CA. He has presented at various meetups and conferences.

    You can find him on LinkedIn at http://www.linkedin.com/in/sujeemaniyam or read more about him at http://sujee.net.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now