Reader small image

You're reading from  HBase Administration Cookbook

Product typeBook
Published inAug 2012
PublisherPackt
ISBN-139781849517140
Edition1st Edition
Right arrow
Author (1)
Yifeng Jiang
Yifeng Jiang
author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang

Right arrow

Introduction


This chapter explains how to set up HBase cluster, from a basic standalone HBase instance to a fully distributed, highly available HBase cluster on Amazon EC2.

According to Apache HBase's home page:

HBase is the Hadoop database. Use HBase when you need random, real-time, read/write access to your Big Data. This project's goal is the hosting of very large tables—billions of rows X millions of columns—atop clusters of commodity hardware.

HBase can run against any filesystem. For example, you can run HBase on top of an EXT4 local filesystem, Amazon Simple Storage Service (Amazon S3), and Hadoop Distributed File System (HDFS) , which is the primary distributed filesystem for Hadoop. In most cases, a fully distributed HBase cluster runs on an instance of HDFS, so we will explain how to set up Hadoop before proceeding.

Apache ZooKeeper is an open source software providing a highly reliable, distributed coordination service. A distributed HBase depends on a running ZooKeeper cluster.

HBase, which is a database that runs on Hadoop, keeps a lot of files open at the same time. We need to change some Linux kernel settings to run HBase smoothly.

A fully distributed HBase cluster has one or more master nodes (HMaster), which coordinate the entire cluster, and many slave nodes (RegionServer), which handle the actual data storage and request. The following diagram shows a typical HBase cluster structure:

HBase can run multiple master nodes at the same time, and use ZooKeeper to monitor and failover the masters. But as HBase uses HDFS as its low-layer filesystem, if HDFS is down, HBase is down too. The master node of HDFS, which is called NameNode, is the Single Point Of Failure (SPOF) of HDFS, so it is the SPOF of an HBase cluster. However, NameNode as a software is very robust and stable. Moreover, the HDFS team is working hard on a real HA NameNode, which is expected to be included in Hadoop's next major release.

The first seven recipes in this chapter explain how we can get HBase and all its dependencies working together, as a fully distributed HBase cluster. The last recipe explains an advanced topic on how to avoid the SPOF issue of the cluster.

We will start by setting up a standalone HBase instance, and then demonstrate setting up a distributed HBase cluster on Amazon EC2.

Previous PageNext Page
You have been reading a chapter from
HBase Administration Cookbook
Published in: Aug 2012Publisher: PacktISBN-13: 9781849517140
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yifeng Jiang

Yifeng Jiang is a Hadoop and HBase Administrator and Developer at Rakutenthe largest e-commerce company in Japan. After graduating from the University of Science and Technology of China with a B.S. in Information Management Systems, he started his career as a professional software engineer, focusing on Java development. In 2008, he started looking over the Hadoop project. In 2009, he led the development of his previous company's display advertisement data infrastructure using Hadoop and Hive. In 2010, he joined his current employer, where he designed and implemented the Hadoop- and HBase-based, large-scale item ranking system. He is also one of the members of the Hadoop team in the company, which operates several Hadoop/HBase clusters
Read more about Yifeng Jiang