We have already explained the advantages of HBase in the preface of this book, where we also dealt with the history and motivation behind HBase. In this chapter, we will make you comfortable with HBase installation. We will teach you how to do the following:
Create the simplest possible, but functional, HBase environment in a minute
Build a more realistic one-machine cluster
Be in full control with clusters of any size
Now it is time for our first cartoon; we will have one for every chapter. This is you, the reader, eager to know the secrets of HBase. With this picture, we understand that learning is a never-ending process.
However, we cannot leave you there. If all you know is building toy clusters, even your most sophisticated HBase design will not command full authority. A pilot needs to know his plane inside out. That is why we will continue with larger and larger cluster installs until you are able to store the planet's data.
Now follow these steps:
Presto! You now have a full HBase development environment that will last you until the end of the book. But remember, to be taken seriously, you need to read till the end and build your clusters.
The first thing that you need to do when learning to deal with HBase is to install it. Moreover, if you are in any way serious about using HBase, you will want to run it, not on a machine, but on a cluster.
Devops (development and operations) is a modern trend in software development that stresses communication, collaboration, and integration between software developers and information technology professionals. It blurs the distinction between a developer and an administrator.
You will have to live up to this movement in person, by doing what is called Build Your Own Clusters (BYOC). Developers will do well by learning what administrators know, at least to some degree, and the same is true about administrators learning about what developers do.
Why is that? This is primarily because efficiency is the name of the game. The reason you choose a NoSQL solution in the first place is for efficiency. Your thoughts around the design will be all around efficiency, that is, the design of the keys, the tables, and the table families. In fact, with NoSQL, you are, for the first time, given the means of reasoning about the efficiency of your solutions, to quote from the Google paper on BigTable, while doing the logical layout of your data.
So, you will be working on single-node clusters. That's true, but you will always be checking your solutions on larger clusters. Still, we need to start with baby steps.
I will show you the work in every detail. I often teach courses, and I have noticed how often, users get stuck on minor details. This is because of miscommunication. Once you build your first hundred clusters, it all seems so easy that you forget how hard it was in the beginning, and you tend to skip the details.
The easiest way to install a single-node HBase cluster starts by going to the HBase Apache website at http://hbase.apache.org/.
Here is what you see there:
The following steps need to be followed:
Click on Downloads.
Choose the latest distribution, download, and unzip it.
Verify the MD5 hash (do not ignore the good software development practices).
Now, start HBase as follows:
$ ./bin/start-hbase.sh starting Master, logging to logs/hbase-user-master-example.org.out
That's it! You are good to go.
You might get the following message:
+=================================================================== | Error: JAVA_HOME is not set and Java could not be found | +----------------------------------------------------------------------+ | Please download the latest Sun JDK from the Sun Java web site | | > http://java.sun.com/javase/downloads/ < | | | | HBase requires Java 1.6 or later. | | NOTE: This script will find Sun Java whether you install using the | | binary or the RPM based installer. | +===================================================================
If this happens to you, set
JAVA_HOME in the file
conf/hbase-env.sh; for example, as follows:
Otherwise, you can set the
JAVA_HOME variable in your environment.
[email protected]:~$ hbase shell 13/12/29 14:47:12 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.6 hbase(main):001:0> status 1 servers, 0 dead, 11.0000 average load hbase(main):002:0> list TABLE EntityGroup mark_users mytable wordcount 4 row(s) in 0.0700 seconds hbase(main):003:0> exit [email protected]:~$
Except this, you should have no tables at this stage; I just showed you mine.
Now we are ready to go for the real stuffâbuilding a cluster.
Granted there are other distributions, and there is also your own hardware. However, let me tell you that your own hardware could cost: $26,243.70. This is how much is cost me.
That's just for three machines, which is barely enough for Hadoop work, and with HBase, you might as well double the memory requirements. Although, in the long run, owning your hardware is better than renting, just as the case is with most of the own versus rent scenarios. However, you might prefer to rent Amazon machines, at a fraction of a dollar per hour and for a few minutes of provisioning time.
Now follow these steps:
Go to the AWS website at http://aws.amazon.com/console/.
Launch your first instance as follows:
Why choose the relatively old version? That is because with Hadoop and HBase, there is no shame in sticking to old versions. There is a good reason for that. The burn-out period for Hadoop is years, running on thousands of machines. So, although you, as a developer, will always prefer the latest and greatest, check yourself. As the wisdom goes, "Who is strong? He who controls his own desires".
There is also another good reason to choose the older version of Ubuntu Server. Most of the Hadoop testing is done on somewhat older versions of the servers. Put yourself in the Hadoop developers' shoes; would you test on a long-term support (seven years) server first, or on the latest and greatest, which promises to put your data in the cloud and to connect you with every social network in the world?
That is why you will have less trouble with the older versions of the OS. I learnt this the hard way.
OK, so now you are convinced and are ready to start your first instance.
Now, you may wonder how much this is going to cost you. First of all, let me give you a convenient table that summarizes AWS costs. Without it, you would have to browse the Amazon services for quite a while. Follow the link at http://www.ec2instances.info/, and here is what you see there. The following table is quite useful:
Secondly, in the words of a well-known Houston personality called Mac (or the Mattress Mac)â"I am going to save you money!" (to know more, follow this link at http://tinyurl.com/kf5vhcg). Do not click on the Start the instance button just yet.
In addition to an on-demand instance, Amazon features what are called spot instances. These are machine hours traded on the market, often at one-tenth the price. So, when you are ready to launch your instances, check the Request spot instances option, just as I have done in the following screenshot:
Here are my savingsâthe
m1.xlarge instance costs 48 cents an hour when purchased on demand, but its current market price is about 5 cents, about 10 times cheaper. I am setting the maximum offered price at 20 cents, and that means I will be paying the fluctuating market price, starting from 5 cents and possibly up, but not more than 20 cents per hour.
I do not delude myself; I know that the big fish (that is, the big EC2-based companies) are hunting for those savings, and they apply sophisticated trading techniques, which sometimes result in strange pricing, exceeding the on-demand pricing. Be careful with the maximum limit you set. But for our limited purposes of practicing cluster construction, we should swim under the belly of the big fish and be just fine.
The next step is selecting the storage. The Ubuntu image comes with 8 GB of root drive, and that is too little for anything; choose 30 GB for now. Remember that each 1 GB costs 5 cents per month at current prices, so for hours, and even days, that is negligible.
By now, you might be asking yourself, where does Mark know all this from? I will give you the two references now, and I will also repeat them at the end of the chapter (for the benefit of those who like to peek at the end). As I have already told you, I run Hadoop/HBase training, and our labs are all open source. Please have a look at https://github.com/hadoop-illuminated/HI-labs for more details. More specifically, in the admin labs, in the Managers section (and that means Hadoop, not human managers), you will find the instructions in brief (https://github.com/hadoop-illuminated/HI-labs/tree/master/hadoop-admin/managers/cloudera-cm). In turn, it refers to the Cloudera blog post found at http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/. However, none of these instructions are as complete as this chapter, so save them for future use.
Now it is time to set up a security group. Here is my hadoop security group. Please note that all servers within this group can communicate with each other on every port. For the outside ports, I have opened those that are required by Cloudera Manager, and by the Hadoop UI for HDFS and MapReduce. Here is me selecting this group for Cloudera Manager I will be using to install the rest of the cluster:
Hold on, we are almost done. Now, let's start 10 more instances that will be used for cluster construction. There are two reasons why I start them myself rather than asking CM to start them for me. Firstly, it results in saving money. I will start spot instances, whereas CM can only start on-demand ones. Secondly, it has better control. If something does not work, I can see it much better than CM can.
These 10 instances will be the workhorses of the cluster, so I will give them enough root space, that is, 100 GB. The CM is smart enough to get the ephemeral storage (about 5 TB) and make it a part of the HDFS. The result will approximately be a 5-TB cluster for one dollar per hour. Here are all of these pending requests:
Now comes your part â building the cluster. Remember, so far Amazon has been working for you, you just provided the right foundation.
Now, log in to the CM machine as follows:
ssh -i .ssh/<your-key-here.pem> [email protected]<cm-url>
The key is what you saved when EC2 created the key pair for you, and
<cm-url> is the URL of the server where you run the Cloudera Manager. Note that I carefully assign the servers their names. Soon, you will have many servers running, and if you don't mark them, it will get confusing. Now, start the install using the following command:
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin chmod +x cloudera-manager-installer.bin sudo ./cloudera-manager-installer.bin
After this is done, give it a minute to start the web server. Then go to
<cm-url>:7180. In my case, this looks as follows:
Log in with both Username and Password as
admin. Accept the free license and continue to the Hosts screen. Now is probably the most important selection. Get the private DNS for every host in the cluster and put it into the Check Hosts window.
One last note, and then I will let you kids go play with your cluster and fly solo. Why is it so important to choose the internal IP, also called the private DNS? Firstly, because you won't be charged for every request. Normally, you get charged for every request and transfer, but for internal transfers, this charge is zero, that is free â nada! Secondly, recall that in our security group, all servers are allowed to communicate with all other servers on all ports. So you won't have any problems setting up your clusters, regardless of which ports the Hadoop services decide to communicate on. If you don't do that, the install will fail on the next step. However, if everything is correct, you will get this happy screen:
Give it the right username (in my case, it is
ubuntu) and the right key on the next screen. I can rely on you to do it right, as these are the same username and key that you used to log into the Cloudera Manager server. If you could do that, you will be able to do this as well.
Don't leave your monitor unattended, so keep clicking at the right times. If you don't, the CM session will time out and you won't be able to restart the install. All the work will be lost; you will have to shut all the servers down and restart them. You've been warned, so get your coffee ready before you start!
It is not uncommon for some servers to fail to start. This is normal in clusters and in a networked environment. CM will drop the servers that fail to start for any reason and continue with what it has.
On one of the next screens, do not forget to request HBase as part of the real-time delivery. There is no deep meaning to this, just marketing, as it is you who will provide the actual real-time delivery with HBase and your code.
Finally, enjoy your new cluster, kick the tires, look around, try to look at every service that is installed, analyze each individual host, and so on. You can always come back home by clicking on the Home or Cloudera button at the top-left of the screen.
Log in to the cluster. Any of the 11 servers, including CM, is good for that, because the Gateway service is installed on each one of them. In my case, the login command looks as follows:
[email protected]:~$ hdfs dfs -ls / Found 3 items drwxr-xr-x - hbase hbase 0 2014-12-30 03:41 /hbase drwxrwxrwt - hdfs supergroup 0 2014-12-30 03:45 /tmp drwxr-xr-x - hdfs supergroup 0 2014-12-30 03:43 /user
However, if you try to create your home directory, it won't work:
hdfs dfs -mkdir /user/ubuntu mkdir: Permission denied: user=ubuntu, access=WRITE, inode=""/user"":hdfs:supergroup:drwxr-xr-x
To fix this, you need to do the following (as described at https://github.com/hadoop-illuminated/HI-labs/tree/master/hadoop-admin/managers/cloudera-cm):
[email protected]:~$ sudo -u hdfs hdfs dfs -mkdir /user/ubuntu [email protected]:~$ sudo -u hdfs hdfs dfs -chown ubuntu /user/ubuntu [email protected]:~$ hdfs dfs -mkdir /user/ubuntu/mark
Now you have your home, and in fact, your user directory (
mark in my case, so that I can see it):
[email protected]:~$ hdfs dfs -ls Found 1 items drwxr-xr-x - ubuntu supergroup 0 2014-12-30 04:03 mark
Moreover, I can even put files there. For example, I can put my install file in
mark/, as follows:
hdfs dfs -put cloudera-manager-installer.bin mark/
And, lo and behold, I can see that file:
hdfs dfs -ls mark Found 1 items -rw-r--r-- 3 ubuntu supergroup 501703 2014-12-30 04:04 mark/cloudera-manager-installer.bin
Now, two last tricks of the trade. The first is to view the HDFS UI, and the second is to open it in the browser or on the command line:
If you use the internal IP (which you can find on the AWS console), as follows, then you will not be blocked by the firewall and you will be able to browse at any level:
If you want to see the HBase, it is found here:
If you have any questions, use the Hadoop illuminated forum found at http://hadoopilluminated.com/ to ask the authors or your peers.
In this chapter, I convinced you that you will be able to build your own HBase clusters, and then spent a large part of the chapter walking you through this process. Please follow the steps precisely! Many hints that are found here are very important, and without them, the cluster will not work.
Once you are at ease with the basic construction, you will be able to strike on your own, change the ways in which you build those clusters, and eventually come up with something new and unexpected.
Please keep in mind that we used the Cloudera Hadoop distribution as a basis for all the instructions. You are not limited to this; you have a choice. The Apache BigTop project is your independent alternative (http://bigtop.apache.org/). HortonWorks and MapR also offer distributions with their managers. All of them provide the same excellent Hadoop distribution. In this book, I wanted to give you a clear set of instructions that worked for me.
For the comparison of different Hadoop distributions, please refer to Chapter 11, Distributions, of our open source book Hadoop Illuminated (http://hadoopilluminated.com/hadoop_illuminated/Distributions.html). If you are interested in the precise installation instructions for other distributions, watch out for our Hadoop illuminated labs at https://github.com/hadoop-illuminated/HI-labs. Eventually, all the distributions will be described there, in the admin labs.
Note that there are exactly 33 pictures in this chapter. This, of course, is no coincidence.
Recall the poem by Omar KhayyÃ¡m, which tells you that there are no coincidences in this world:
"The Moving Finger writes; and, having writ,
Moves on: nor all thy Piety nor Wit
Shall lure it back to cancel half a Line,
Nor all thy Tears wash out a Word of it."
One can argue whether tears can or cannot erase our mistakes, but in this world of clusters, we can always try by repeating the steps again and again.
In the next chapter, we will discuss using Java code to read from and write to HBase. We will also see how we can control HBase with the help of the HBase shell. The most important thing we will learn is to operate through SQL statements, in a manner familiar to all SQL database users.