Home

Data

Learning HBase

By Shashwat Shriparv

Book

eBook $28.99 $19.99

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $28.99 $19.99

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Publication date:: November 2014
Publisher: Packt
Pages: 326
ISBN: 9781783985944

Chapter 1. Understanding the HBase Ecosystem

HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema.

HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is compressed, high-performance, proprietary data store built on the Google file system. HBase was a developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS).

The following table contains key information about HBase and its features:

Features	Description
Developed by	Apache
Written in	Java
Type	Column oriented
License	Apache License
Lacking features of relational databases	SQL support, relations, primary, foreign, and unique key constraints, normalization
Website	http://hbase.apache.org
Distributions	Apache, Cloudera
Download link	http://mirrors.advancedhosters.com/apache/hbase/
Mailing lists	The user list: `<user-subscribe@hbase.apache.org>` The developer list: `<dev-subscribe@hbase.apache.org>`
Blog	http://blogs.apache.org/hbase/

HBase layout on top of Hadoop

The following figure represents the layout information of HBase on top of Hadoop:

There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore.

HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster.

Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes.

Comparing architectural differences between RDBMs and HBase

Let's list the major differences between relational databases and HBase:

Relational databases	HBase
Uses tables as databases	Uses regions as databases
File systems supported are FAT, NTFS, and EXT	File system supported is HDFS
The technique used to store logs is commit logs	The technique used to store logs is Write-Ahead Logs (WAL)
The reference system used is coordinate system	The reference system used is ZooKeeper
Uses the primary key	Uses the row key
Partitioning is supported	Sharding is supported
Use of rows, columns, and cells	Use of rows, column families, columns, and cells

HBase features

Let's see the major features of HBase that make it one of the most useful databases for the current and future industry:

Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication.
Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead.
Hadoop/HDFS integration: It's important to note that HBase can run on top of other file systems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS.
Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks.
MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase.
Note
You can search for the package org.apache.hadoop.hbase.mapreduce for more details.
Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming.
Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase.
Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios.
Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it.
Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that.
Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow.
HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands.
Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record.
Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data.

HBase in the Hadoop ecosystem

Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:

HBase can work as a separate entity on the local file system (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.

Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.

Data representation in HBase

Let's look into the representation of rows and columns in HBase table:

An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.

So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.

Hadoop

Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:

Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules.
Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local file system that provides high throughput of read and write operations of data on Hadoop.
Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management.
Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets.

Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.

Core daemons of Hadoop

The following are the core daemons of Hadoop:

NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability.
JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster.
SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes.
DataNode: This will contain the actual data.
TaskTracker: This will perform tasks on the local data assigned by the JobTracker.

The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:

Hadoop 1	Hadoop 2
HDFS
NameNode Secondary NameNode DataNode	NameNode (more than one active/standby) Checkpoint node DataNode
Processing
MapReduce v1 JobTracker TaskTracker	YARN (MRv2) ResourceManager NodeManager Application Master

Comparing HBase with Hadoop

As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding:

Hadoop/HDFS	HBase
This provide file system for distributed storage	This provides tabular column-oriented data storage
This is optimized for storage of huge-sized files with no random read/write of these files	This is optimized for tabular data with random read/write facility
This uses flat files	This uses key-value pairs of data
The data model is not flexible	Provides a flexible data model
This uses file system and processing framework	This uses tabular storage with built-in Hadoop MapReduce support
This is mostly optimized for write-once read-many	This is optimized for both read/write many

Comparing functional differences between RDBMs and HBase

Lately, we are hearing about NoSQL databases such as HBase, so let's just understand what actually HBase has and lacks in comparison to conventional relational databases that have existed for so long now. The following table differentiates it well:

Relational database	HBase
This supports scale up. In other words, when more disk and memory processing power is needed, we need to upgrade it to a more powerful server.	This supports scale out. In other words, when more disk and memory processing power is needed, we need not upgrade the server. However, we need to add new servers to the cluster.
This uses SQL queries for reading records from tables.	This uses APIs and MapReduce for accessing data from HBase tables.
This is row oriented, that is, each row is a contiguous unit of page.	This is column oriented, that is, each column is a contiguous unit of page.
The amount of data depends on configuration of server.	The amount of data does not depend on the particular machine but the number of machines.
It's Schema is more restrictive.	Its schema is flexible and less restrictive.
This has ACID support.	There is no built-in support for HBase.
This is suited for structured data.	This is suited to both structured and nonstructural data.
Conventional relational database is mostly centralized.	This is always distributed.
This mostly guarantees transaction integrity.	There is no transaction guaranty in HBase.
This supports JOINs.	This does not support JOINs.
This supports referential integrity.	There is no in-built support for referential integrity.

So with these differences, both have their own usability and use cases. When we have a small amount of data that can be accommodated in RDBMS without performance lagging, we can go with RDBMS.

When we need more Online Transaction Processing (OLTP) and the transaction type of processing, RDBMS is easy to go. When we have a huge amount of data (in terabytes and petabytes), we should look towards HBase, which is always better for aggregation on columns and faster processing.

We have gone through the word, column-oriented database, in the previous introduction; now let's discuss the difference between the column-oriented databases and the row-oriented databases, which are the traditional relational databases.

These column-oriented database systems have been shown to perform more than an order of magnitude, better than traditional row-oriented database systems on analytical workloads found in data warehouse systems, decision system, and business intelligence applications. These are more I/O-efficient for write-once read-many queries.

Logical view of row-oriented databases

The following figure shows how data is represented in relational databases:

Logical view of column-oriented databases

The following figure shows how logically we can represent NoSQL/column-oriented databases such as HBase:

Row-oriented data stores store rows in a contiguous unit on the page, and the number of rows are packed into a page. They are much faster for small numbers of rows and slow for aggregation. On the contrary, column-oriented data stores columns in a contiguous unit on the page, columns may extend up to millions of entries, so they run for many pages. These are much faster for aggregation and analytics. The root of column-oriented database systems can be traced to the 1970 when transposed file first appeared. Column-oriented data stores are better for compression than row-oriented data stores. The following is the comparison between these two:

Row-oriented data stores	Column-oriented data stores
These are efficient for addition/modification of records	These are efficient for reading data
They read pages containing entire rows	They read only needed columns
These are best for OLTP	These are not so optimized for OLTP yet
This serializes all the values in a row together, then the value in the next row, and so on	This serializes all the value of columns together and so on
Row data are stored in contiguous pages in memory or on disk	Columns are stored in pages in memory or on disk

Suppose the records of a table are stored in the pages of memory. When they need to be accessed, these pages are brought to the primary memory, if they are not already present in the memory.

If one row occupies a page and we need all specific column such as salary or rate of interest from all the rows for some kind of analytics, each page containing the columns has to be brought in the memory; so this page in page out will result in a lot of I/O, which may result in prolonged processing time.

In column-oriented databases, each column will be stored in pages. If we need to fetch a specific column, there will be less I/O as only the pages that contain the specified column needed to be brought in the main memory and read, and we need not bring and read all the pages containing rows/records henceforth into the memory. So the kind of queries where we need to just fetch specific columns and not whole record(s) or sets is served best in column-oriented database, which is useful for analytics wherein we can fetch some columns and do some mathematical operations such as sum and average.

Pros and cons of column-oriented databases

The following are pros of column-oriented database:

This has built-in support for efficient and data compression.
This supports fast data retrieval.
Administration and configuration is simplified. It can be scaled out and hence is very easy to expand.
This is good for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX).
This is efficient for partitioning as it provides features of automatic sharding mechanism to distribute bigger regions to smaller ones.

The following are cons of column-oriented database:

Queries with JOINs and data from many tables are not optimized.
It records and deletes lot of updates and has to make frequent compaction and splits too. This reduces its storage efficiency.
Partitioning or indexing schemes can be difficult to design as a relational concept is not implicit.

About the internal storage architecture of HBase

The following figure shows the principle algorithm and data structure HBase works on, that is, LSM-tree, and the way of merging, and precedes the explanation:

HBase stores file using LSM-tree, which maintains data in two separate parts that are optimized for underlying storage. This type of data structure depends on two structures, a current and smaller one in memory and a bigger one on the persistent disk, and once the part in memory becomes bigger than a certain limit, it is merged with the bigger structure that is stored on the disk using a merge sort algorithm and a new in-memory tree is created for newer insert requests. It transforms random data access into sequential data access, which improves read performance, and merging is a background process, which does not affect the foreground processing.

Getting started with HBase

We will discuss this section in a bit questionnaire manner, and will come to understand HBase with the help of scenarios and conditions.

When it started

The following figure shows the flow of HBase: birth, growth, and current status:

It all started in 2006 when a company called Powerset (later acquired by Microsoft) was looking forward to building a natural language search engine. The person responsible for the development was Jim Kellerman and there were also many other contributors. It was modeled around the Google BigTable white paper that came out in 2006, which was running on Google File System (GFS).

It started with a TAR file with a random bunch of Java files with initial HBase code. It was first added to the contrib directory of Hadoop as a small subcomponent of Hadoop and with the dedicated effort for filling up gaps; it has slowly and steadily grown into a full-fledged project. It was first added with Hadoop 0.1.0 and as it become more and more feature rich and stable, it was promoted to Hadoop subproject and then slowly with more and more development and contribution from the HBase user and developer group, it has became one of the top-level projects at Apache.

The following figure shows HBase versions from the beginning till now:

Let's have a look at the year-by-year evolution of HBase's important features:

2006: The idea of HBase started with the white paper of Google BigTable
2007: Data compression on the per column family was made available, addition and deletion of column families online was added, script to start/stop HBase cluster was added, MapReduce connecter was added, HBase shell support was added, support of row and column filter was added, algorithm to distribute region was evenly added, first rest interface was added, and hosted the first HBase meetup
2008: HBase 0.1.0 to 0.18.1, HBase moved to new SVN, HBase added as a contribution to Hadoop, HBase become Hadoop subproject, first separate release became available, and Ruby shell added
2009: HBase 0.19.0 to 0.20.*, improvement in writing and scanning, batching of writing and scanning, block compression, HTable interface to REST, addition of binary comparator, addition of regular expression filters, and many more
2010 till date: HBase 0.89.* - 0.94.*, support for HDFS durability, improvement in import flow, support for MurmurHash3, addition of daemon threads for NameNode and DataNode to indicate the VM or kernel-caused pause in application log, tags support for key value, running MapReduce over snapshot files, introduction of transparent encryption of HBase on disk data, addition of per key-value security, offline rebuilding .META. from file system data, snapshot support, and many more
Note
For more information, just visit https://issues.apache.org/jira/browse/HBASE and explore more detailed explanations and more lists of improvements and the addition of new features.
0.96 to 1.0 and Future: HBase Version 1 and higher, add utility for adorning HTTP context, fully high availability with Hadoop HA, rolling upgrades, improved failure detection and recovery, cell-level access security, inline cell tagging, quota and grouping, reverse scan, rolling upgrade, and it will be more useful for analytics purposes and helpful for data scientists
Note
While we wait for new features in v1.0; we can always visit http://hbase.apache.org for the latest releases and features.
Here is the link from where we can download HBase versions:
http://apache.mirrors.tds.net/hbase/stable

Let's now discuss HBase and Hadoop compatibility and the features they provide together.

Prior to Hadoop v1.0, when DataNode used to crash, HBase Write-Ahead Log—the logfiles that maintain the read/write operation before the final writing is done to the MemStore—would be lost and hence the data too. This version of Hadoop integrated append branch into the main core, which increased the durability for HBase. Hadoop v1.0 has also implemented the facility of disk failure, making RegionServer more robust.

Hadoop v2.0 has integrated high availability of NameNode, which also enables HBase to be more reliable and robust by enabling the multiple HMaster instances. Now with this version of HBase, upgrading has become easy because it is made independent of HDFS upgrades. Let's see in the following table how recent versions of Hadoop have enhanced HBase on the basis of performance, availability, and features:

Criteria	Hadoop v1.0	Hadoop v2.0	Hadoop v2.x
Features	Durability using `hflush()`	`hsync()` snapshot hard linking (https://issues.apache.org/jira/browse/HDFS-3370) HBase-aware block placement
Performance	Short-circuit read	Native CRC Datanodekeepalive	Direct write API (HBase provides the utility classes and the ImportTSV tool itself to write directly into HFile. Then, using the `IncrementalLoadHFile`, these files are loaded into the regions managed by RS. Once these two steps are over, client can read the data normally) Zero copy API (in this operation, the CPU does not copy data from one memory area to another. This is used to save on processing power and memory when sending files over a network) Direct codec API (used at server side for writing cells to WAL as well as for sending edits as part of the distributed-splitting process)

Note

Miscellaneous features in newer HBase are HBase isolation and allocation, online-automated repair of table integrity and region consistency problems, dynamic configuration changes, reverse scanning (stop row to start row), and many other features; users can visit https://issues.apache.org/jira/browse/HBASE for features and advancement of each HBase release.

HBase components and functionalities

Here let's discuss various components of HBase and their components recursively:

ZooKeeper
HMaster
RegionServer
Client
Catalog tables

ZooKeeper

ZooKeeper is a high-performance, centralized, multicoordination service system for distributed application, which provides a distributed synchronization and group service to HBase.

It enables the users and developer to focus on the application logic and not on the coordination with the cluster, for which it provides some API that can be used by the developers to use and implement coordination task such as master server, and managing application and cluster communication system.

In HBase, ZooKeeper is used to elect a cluster master in order to keep track of available and online servers, and to keep the metadata of the cluster. ZooKeeper APIs provide:

Consistency, ordering, and durability
Synchronization
Concurrency for a distributed clustered system

The following figure shows ZooKeeper:

It was developed at Yahoo Research. And the reason behind the name ZooKeeper is that in Hadoop system, projects are based on animal names, and in discussion regarding naming this technology, this name emerged as it manages the availability and coordination between different components of a distributed system.

ZooKeeper not only simplifies the development but also sits on the top distributed system as an abstraction layer to facilitate the better reachability to the components of the system. The following figure shows the request and response flow:

Let's consider a scenario wherein we have a few people who want to fill 10 rooms with some items. One instance would be where we will show how they find their way to the room to keep the items. Some of the rooms will be locked, which will lead the people to move on to other rooms. The other instance would be where we can allocate some representatives with information about the rooms, condition of rooms, and state of rooms (open, closed, fit for storing, not fit, and so on). We can then send them with items to those representatives for the information. The representative will guide the person towards the right room, which is available for storage of items, and the person can directly move to the specified room and store the item. This will not only ease the communication and the storage process but also reduce the overhead from the process. The same technique can be applied in the case of the ZooKeepers.

ZooKeeper maintains a tree with ZooKeeper data internally called a znode. This can be of two types:

Ephemeral, which is good for applications that need to understand whether a specific distributed resource is available or not.
The persistent one will be stored till a client does not delete it explicitly and it stores some data of the application too.

Why an odd number of ZooKeepers?

ZooKeepers are based on a majority principle; it requires that we have a quorum of servers to be up, where quorum is ceil(n/2), for a cluster of three nodes ensemble means two nodes must be up and running at any point of time, and for five node ensemble, a minimum three nodes must be up. It's also important for election purpose for the ZooKeeper master. We will discuss more options of configuration and coding of ZooKeeper in later chapters.

HMaster

HMaster is the component of the HBase cluster that can be thought of as NameNode in the case of Hadoop cluster; likewise, it acts as a master for RegionServers running on different machines. It is responsible for monitoring all RegionServers in an HBase cluster and also provides an interface to all HBase metadata for the client operations. It also handles RegionServer failover, and region splits.

There may be more than one instance of HMaster in an HBase cluster that provides High Availability (HA). So, if we have more than one master, only one master is active at a time; at the start up time, all the masters compete to become the active master in the cluster and whichever wins becomes the active master of the cluster. Meanwhile, all other master instances remain passive till the active master crashes, shuts down, or loses a lease from the ZooKeeper.

In short, it is a coordination component in an HBase cluster, which also manages and enables us to perform an administrative task on the cluster.

Let's now discuss the flow of starting up the HMaster process:

Block (do not serve requests) until it becomes active HMaster.
Finish initialization.
Enter loop until stopped.
Do cleansing when it is stopped.

HMaster exports some of the following interfaces that are metadata-based methods to enable us to interact with HBase:

Related to	Facilities
HBase tables	Creating table, deleting table, enabling/disabling table, and modifying table
HBase column families	Adding columns, modifying columns, and removing columns
HBase table regions	Moving regions, assigning regions, and unassign regions

In HBase, there is a table called .META. (table name on file system), which keeps all information about regions that is referred by HMaster for information about the data. By default, HMaster runs on port number 60000 and its HTTP Web UI is available on port 60010, which can always be changed according to our need.

HMaster functionalities can be summarized as follows:

Monitors RegionServers
Handles RegionServers failover
Handles metadata changes
Assignment/unassignment of regions
Interfaces all metadata changes
Performs reload balancing in idle time
It publishes its location to client using ZooKeeper
HMaster Web UI provides all information about HBase cluster (table, regions, RegionServers and so on)

If a master node goes down

If master goes down, in this scenario, the cluster may continue working normally as clients talk directly to RegionServers. So, cluster may still function steadily. The HBase catalog table (.META. and -ROOT-) exists as HBase tables and it's not stored in master resistant memory. However, as master performs critical functions such as RegionServers' failovers and region splits, these functions may be hampered and if not taken care will create a huge setback to the overall cluster functioning, so the master must be started as soon as possible.

So now, Hadoop is HA enabled and thus HBase can always be made HA using multiple HMasters for better availability and robustness, so we can now consider having multiple HMaster.

RegionServer

RegionServers are responsible for holding the actual raw HBase data. Recall that in a Hadoop cluster, a NameNode manages the metadata and a DataNode holds the raw data. Likewise, in HBase, an HBase master holds the metadata and RegionServer's store. These are the servers that hold the HBase data, as we may already know that in Hadoop cluster, NameNode manages the metadata and DataNode holds the actual data. Likewise, in HBase cluster, RegionServers store the raw actual data. As you might guess, a RegionServer is run or is hosted on top of a DataNode, which utilizes the underlying DataNodes at underlying file system, that is, HDFS.

The following figure shows the architecture of RegionServer:

RegionServer performs the following tasks:

Serving regions(tables) assigned to it
Handling client read/write requests
Flushing cache to HDFS
Maintaining HLogs
Performing compactions
Responsible for handling region splits

Components of a RegionServer

The following are the components of RegionServers

Write-Ahead logs: This is also called edit. When data is read/modified to HBase, it's not directly written in the disk rather it is kept in memory for some time (threshold, which we can configure based on size and time). Keeping this data in memory may result in a loss of data if the machine goes down abruptly. So to solve this, the data is first written in an intermediate file, which is called Write-Ahead logfile and then in memory. So in the case of system failure, data can be reconstructed using this logfile.
HFile: These are the actual files where the raw data is stored physically on the disk. This is the actual store file.
Store: Here the HFile is stored. It corresponds to a column family for a table in HBase.
MemStore: This component is in memory data store; this resides in the main memory and records the current data operation. So, when data is stored in WAL, RegionServers stores key-value in memory store.
Region: These are the splits of HBase table; the table is divided into regions based on the key and are hosted by RegionServers. There may be different regions in a RegionServer.

We will discuss more about these components in the next chapter.

Client

Client is responsible for finding the RegionServer, which is hosting the particular row (data). It is done by querying the catalog tables. Once region is found, the client directly contacts RegionServers and performs the data operation. Once this information is fetched, it is cached by the client for further fast retrieval. The client can be written in Java or any other language using external APIs.

Catalog tables

There are two tables that maintain the information about all RegionServers and regions. This is a kind of metadata for the HBase cluster. The following are the two catalog tables that exist in HBase:

-ROOT-: This includes information about the location of .META. table
.META.: This table holds all regions and their locations

At the beginning of the start up process, the .mMeta location is set to root from where the actual metadata of tables are read and read/write continues. So, whenever a client wants to connect to HBase and read or write into table, these two tables are referred and information is returned to client for direct read and write to the RegionServers and the regions of the specific table.

Who is using HBase and why?

The following is a list of just a few companies that use HBase in production. There are many companies who are using HBase, so we will list a few and not all.

Adobe: They have an HBase cluster of 30 nodes and are ready to expand it. They use HBase in several areas from social services to structured data and processing for internal use.
Facebook: They use it for messaging infrastructure.
Twitter: They use it for a number of applications including people search, which relies on HBase internally for data generation, and also their operations team uses HBase as a time series database for cluster-wide monitoring/performance data.
Infolinks: They use it for process advertisement selection and user events for our in-text advertising network.
StumbleUpon: They use it with MapReduce data source to overcome traditional query speed limits in MySQL.
Trend Micro: They use it as cloud-based storage.
Yahoo!: They are use HBase to store document fingerprint for detecting near-duplicates. They have a cluster of a few nodes that run HDFS, MapReduce, and HBase. The table contains millions of rows; we use this for querying duplicated documents with real-time traffic.
Ancestry.com: This company uses it for DNA analysis.
UIDAI: This is an Indian government project; they use HBase for storing resident details.
Apache: They use it for maintaining wiki.
Mozilla: They are moving Socorro project to HBase.
eBay: They use HBase for indexing site inventory.

Note

And we can keep listing, but we will stop it here and for further information, please visit http://wiki.apache.org/hadoop/Hbase/PoweredBy.

When should we think of using HBase?

Using HBase is not the solution to all problems; however, it can solve a lot of problems efficiently. The first thing is that we should think about the amount of data; if we have a few million rows and a few read and writes, then we can avoid using it. However, think of billions of columns and thousands of read/write data operations in a short interval, we can surely think of using HBase.

Let's consider an example, Facebook uses HBase for its real-time messaging infrastructure and we can think of how many messages or rows of data Facebook will be receiving per second. Considering that amount of data and I/O, we can currently think of using HBase. The following list details a few scenarios when we can consider using HBase:

If data needs to have a dynamic or variable schema
If a number of columns contain more null values (blank columns)
When we have a huge number of dynamic rows
If our data contains a variable number of columns
If we need to maintain versions of data
If high scalability is needed
If we need in-built compression on records
If a high volume of I/O is needed

There are many other cases where we can use HBase and it can be beneficial, which is discussed in later chapters.

When not to use HBase

Let's now discuss some points when we don't compulsorily have to use HBase just because everyone else is using it:

When data is not in large amounts (in TBs and more)
When JOINs and relational DB features are needed
Don't go with the belief "every one is using it"
If RDBMS fits your requirements, use RDBMS

Understanding some open source HBase tools

The following is the list of some HBase tools that are available in the development world:

hbaseexplorer: This tool provides UI for HBase; using this tool, we can perform the following operations:
- Data visualization
- Table creation, deletion, and cloning
- Table statistics
- Scans
Note
For reference, go to http://sourceforge.net/projects/hbaseexplorer/.
Toad for Cloud Databases: This is the tool to connect to HBase and perform various functions.
Note
For reference, go to http://www.toadworld.com/products/toad-for-cloud-databases/default.aspx.
HareDB HBase Client: This is an HBase client, which can be used more easily with its user-friendly interface, which makes it a GUI tool for HBase (including PIG and high speed Hive Query)
Note
For reference, go to http://sourceforge.net/projects/haredbhbaseclie/.
hrider: The hrider is a UI application that provides an easier way to view or manipulate the data saved in the HBase.
Note
For reference, go to https://github.com/NiceSystems/hrider.
Hannibal: This is a tool for Apache HBase region monitoring.
Note
For reference, go to https://github.com/sentric/hannibal.
Performance Monitoring & Alerting (SPM): SPM is a proactive performance monitoring, anomaly detection, and alerting solution available in the Cloud (SaaS) as well as own premise. SPM can monitor Solr, Elasticsearch, Hadoop, HBase, ZooKeeper, Kafka, Storm, Redis, JVM, system metrics, custom metrics, and more.
Note
For reference, go to http://sematext.com/spm/.
Phoenix: This tool is a SQL skin over HBase, delivered as a client-embedded JDBC driver, powering the HBase use cases at Salesforce.com. Phoenix targets low-latency queries (milliseconds), as opposed to batch operation via MapReduce.
Note
For reference, go to https://github.com/forcedotcom/phoenix and http://phoenix.apache.org/.
Impala: Cloudera Impala is a parallel processing SQL query engine, which runs in Apache Hadoop. The Apache-licensed, open source Impala project combines modern, scalable, parallel database technology with the power of Hadoop. Users can directly query data stored in HDFS and Apache HBase without requiring data movement or transformation.
Note
For reference, go to http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Cloudera-Impala-Version-and-Download-Information/Cloudera-Impala-Version-and-Download-Information.html.

The Hadoop-HBase version compatibility table

As there are compatibility issues in almost all systems; likewise, HBase also has compatibility issues with Hadoop versions, which means all versions of HBase can't be used use on top of all Hadoop versions. The following is the version compatibility of Hadoop-HBase that should be kept in mind while configuring HBase on Hadoop (credit: Apache):

Hadoop versions	HBase 0.92.x	HBase 0.94.x	HBase 0.96.0	HBase 0.98.0
Hadoop 0.20.205	Supported	Not supported	Not supported	Not supported
Hadoop 0.22.x	Supported	Not supported	Not supported	Not supported
Hadoop 1.0.0-1.0.2	Supported	Supported	Not supported	Not supported
Hadoop 1.0.3+	Supported	Supported	Supported	Not supported
Hadoop 1.1.x	Not tested enough	Supported	Supported	Not supported
Hadoop 0.23.x	Not supported	Supported	Not tested enough	Not supported
Hadoop 2.0.x-alpha	Not supported	Not tested enough	Not supported	Not supported
Hadoop 2.1.0-beta	Not supported	Not tested enough	Supported	Not supported
Hadoop 2.2.0	Not supported	Not tested enough	Supported	Supported
Hadoop 2.x	Not supported	Not tested enough	Supported	Supported

We can always visit https://hbase.apache.org for more updated version compatibility between HBase and Hadoop.

Applications of HBase

The applications of HBase are as follows:

Medical: HBase is used in the medical field for storing genome sequences and running MapReduce on it, storing the disease history of people or an area, and many others.
Sports: HBase is used in the sports field for storing match histories for better analytics and prediction.
Web: HBase is used to store user history and preferences for better customer targeting.
Oil and petroleum: HBase is used in the oil and petroleum industry to store exploration data for analysis and predict probable places where oil can be found.
e-commerce: HBase is used for recording and storing logs about customer search history, and to perform analytics and then target advertisement for better business.
Other fields: HBase can be used in many other fields where it's needed to store petabytes of data and run analysis on it, for which traditional systems may take months. We will discuss more about use cases and industry usability in further chapters.

HBase pros and cons

Let's now briefly discuss HBase pros and cons.

The following are some advantages of HBase:

Great for analytics in association with Hadoop MapReduce
It can handle very large volumes of data
Supports scaling out in coordination with Hadoop file system even on commodity hardware
Fault tolerance
License free
Very flexible on schema design/no fixed schema
Can be integrated with Hive for SQL-like queries, which is better for DBAs who are more familiar with SQL queries
Auto-sharding
Auto failover
Simple client interface
Row-level atomicity, that is, the PUT operation will either write or fail

The following are some missing aspects:

Single point of failure (when only one HMaster is used)
No transaction support
JOINs are handled in MapReduce layer rather than the database itself
Indexed and sorted only on key, but RDBMS can be indexed on some arbitrary field
No built-in authentication or permissions

So overall, we can say if we are in a position to neglect these cons, we can go with HBase which provides many other benefits that are not there in RDBMS. We can see that it's still an evolving technology with Hadoop and with time, it will become more mature and rich, which will make it one of the best tools for analytical database and distributed fault tolerant database. It is an open source Apache project where users and developers can contribute and add more and more features.

Hadoop HBase and a combination of some other Hadoop subproject can do wonders in the data analysis field; using these technologies, the data can be a hidden treasure, which were stored somewhere uselessly as a dump and now they can be very beneficial for understanding various prospects of a specific industry.

Summary

So in this chapter, we discussed the introductory aspects of HBase and related projects. We have also discussed HBase's components and their place in the HBase ecosystem. This chapter then provided a brief historical context for HBase and we have related it with some common uses of HBase in the industry.

In the next chapter, we will begin with HBase, understanding the different considerations and prerequisites for getting started with HBase. We will also discuss some of the concerns a new user might face during their first encounter with HBase.

About the Author

Shashwat Shriparv

Shashwat Shriparv was born in Muzaffarpur, Bihar. He did his schooling from Muzaffarpur and Shillong, Meghalaya. He received his BCA degree from IGNOU, Delhi and his MCA degree from Cochin University of Science and Technology, Kerala (C-DAC Trivandrum). He was introduced to Big Data technologies in early 2010 when he was asked to perform a proof of concept (POC) on Big Data technologies in storing and processing logs. He was also given another project, where he was required to store huge binary files with variable headers and process them. At this time, he started configuring, setting up, and testing Hadoop HBase clusters and writing sample code for them. After performing a successful POC, he initiated serious development using Java REST and SOAP web services, building a system to store and process logs to Hadoop using web services, and then storing these logs in HBase using homemade schema and reading data using HBase APIs and HBase-Hive mapped queries. Shashwat successfully implemented the project, and then moved on to work on huge binary files of size 1 to 3 TB, processing the header and storing metadata to HBase and files on HDFS. Shashwat started his career as a software developer at C-DAC Cyber Forensics, Trivandrum, building mobile-related software for forensics analysis. Then, he moved to Genilok Computer Solutions, where he worked on cluster computing, HPC technologies, and web technologies. After this, he moved to Bangalore from Trivandrum and joined PointCross, where he started working with Big Data technologies, developing software using Java, web services, and platform as Big Data. He worked on many projects revolving around Big Data technologies, such as Hadoop, HBase, Hive, Pig, Sqoop, Flume, and so on at PointCross. From here, he moved to HCL Infosystems Ltd. to work on the UIDAI project, which is one of the most prestigious projects in India, providing a unique identification number to every resident of India. Here, he worked on technologies such as HBase, Hive, Hadoop, Pig, and Linux, scripting, managing HBase Hadoop clusters, writing scripts, automating tasks and processes, and building dashboards for monitoring clusters. Currently, he is working with Cognilytics, Inc. on Big Data technologies, HANA, and other high-performance technologies. You can find out more about him at https://github.com/shriparv and http://helpmetocode.blogspot.com. You can connect with him on LinkedIn at http://www.linkedin.com/pub/shashwat-shriparv/19/214/2a9. You can also e-mail him at dwivedishashwat@gmail.com. Shashwat has worked as a reviewer on the book Pig Design Pattern, Pradeep Pasupuleti, Packt Publishing. He also contributed to his college magazine, InfinityTech, as an editor.
Browse publications by this author