Learning Hbase

Chapter 1. Understanding the HBase Ecosystem

HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema.

HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is compressed, high-performance, proprietary data store built on the Google file system. HBase was a developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS).

The following table contains key information about HBase and its features:

Features	Description
Developed by	Apache
Written in	Java
Type	Column oriented
License	Apache License
Lacking features of relational databases	SQL support, relations, primary, foreign, and unique key constraints, normalization
Website	http://hbase.apache.org
Distributions	Apache, Cloudera
Download link	http://mirrors.advancedhosters.com/apache/hbase/
Mailing lists	The user list: `<user-subscribe@hbase.apache.org>` The developer list: `<dev-subscribe@hbase.apache.org>`
Blog	http://blogs.apache.org/hbase/

Comparing architectural differences between RDBMs and HBase

Let's list the major differences between relational databases and HBase:

Relational databases	HBase
Uses tables as databases	Uses regions as databases
File systems supported are FAT, NTFS, and EXT	File system supported is HDFS
The technique used to store logs is commit logs	The technique used to store logs is Write-Ahead Logs (WAL)
The reference system used is coordinate system	The reference system used is ZooKeeper
Uses the primary key	Uses the row key
Partitioning is supported	Sharding is supported
Use of rows, columns, and cells	Use of rows, column families, columns, and cells

HBase features

Let's see the major features of HBase that make it one of the most useful databases for the current and future industry:

Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication.
Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead.
Hadoop/HDFS integration: It's important to note that HBase can run on top of other file systems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS.
Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks.
MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase.
Note
You can search for the package org.apache.hadoop.hbase.mapreduce for more details.
Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming.
Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase.
Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios.
Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it.
Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that.
Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow.
HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands.
Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record.
Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data.

HBase in the Hadoop ecosystem

Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:

HBase can work as a separate entity on the local file system (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.

Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.

Data representation in HBase

Let's look into the representation of rows and columns in HBase table:

An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.

So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.

Hadoop

Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:

Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules.
Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local file system that provides high throughput of read and write operations of data on Hadoop.
Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management.
Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets.

Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.

Core daemons of Hadoop

The following are the core daemons of Hadoop:

NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability.
JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster.
SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes.
DataNode: This will contain the actual data.
TaskTracker: This will perform tasks on the local data assigned by the JobTracker.

The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:

Hadoop 1	Hadoop 2
HDFS
NameNode Secondary NameNode DataNode	NameNode (more than one active/standby) Checkpoint node DataNode
Processing
MapReduce v1 JobTracker TaskTracker	YARN (MRv2) ResourceManager NodeManager Application Master

Comparing HBase with Hadoop

As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding:

Hadoop/HDFS	HBase
This provide file system for distributed storage	This provides tabular column-oriented data storage
This is optimized for storage of huge-sized files with no random read/write of these files	This is optimized for tabular data with random read/write facility
This uses flat files	This uses key-value pairs of data
The data model is not flexible	Provides a flexible data model
This uses file system and processing framework	This uses tabular storage with built-in Hadoop MapReduce support
This is mostly optimized for write-once read-many	This is optimized for both read/write many

Comparing functional differences between RDBMs and HBase

Lately, we are hearing about NoSQL databases such as HBase, so let's just understand what actually HBase has and lacks in comparison to conventional relational databases that have existed for so long now. The following table differentiates it well:

Relational database	HBase
This supports scale up. In other words, when more disk and memory processing power is needed, we need to upgrade it to a more powerful server.	This supports scale out. In other words, when more disk and memory processing power is needed, we need not upgrade the server. However, we need to add new servers to the cluster.
This uses SQL queries for reading records from tables.	This uses APIs and MapReduce for accessing data from HBase tables.
This is row oriented, that is, each row is a contiguous unit of page.	This is column oriented, that is, each column is a contiguous unit of page.
The amount of data depends on configuration of server.	The amount of data does not depend on the particular machine but the number of machines.
It's Schema is more restrictive.	Its schema is flexible and less restrictive.
This has ACID support.	There is no built-in support for HBase.
This is suited for structured data.	This is suited to both structured and nonstructural data.
Conventional relational database is mostly centralized.	This is always distributed.
This mostly guarantees transaction integrity.	There is no transaction guaranty in HBase.
This supports JOINs.	This does not support JOINs.
This supports referential integrity.	There is no in-built support for referential integrity.

So with these differences, both have their own usability and use cases. When we have a small amount of data that can be accommodated in RDBMS without performance lagging, we can go with RDBMS.

When we need more Online Transaction Processing (OLTP) and the transaction type of processing, RDBMS is easy to go. When we have a huge amount of data (in terabytes and petabytes), we should look towards HBase, which is always better for aggregation on columns and faster processing.

We have gone through the word, column-oriented database, in the previous introduction; now let's discuss the difference between the column-oriented databases and the row-oriented databases, which are the traditional relational databases.

These column-oriented database systems have been shown to perform more than an order of magnitude, better than traditional row-oriented database systems on analytical workloads found in data warehouse systems, decision system, and business intelligence applications. These are more I/O-efficient for write-once read-many queries.

Logical view of row-oriented databases

The following figure shows how data is represented in relational databases:

Logical view of column-oriented databases

The following figure shows how logically we can represent NoSQL/column-oriented databases such as HBase:

Row-oriented data stores store rows in a contiguous unit on the page, and the number of rows are packed into a page. They are much faster for small numbers of rows and slow for aggregation. On the contrary, column-oriented data stores columns in a contiguous unit on the page, columns may extend up to millions of entries, so they run for many pages. These are much faster for aggregation and analytics. The root of column-oriented database systems can be traced to the 1970 when transposed file first appeared. Column-oriented data stores are better for compression than row-oriented data stores. The following is the comparison between these two:

Row-oriented data stores	Column-oriented data stores
These are efficient for addition/modification of records	These are efficient for reading data
They read pages containing entire rows	They read only needed columns
These are best for OLTP	These are not so optimized for OLTP yet
This serializes all the values in a row together, then the value in the next row, and so on	This serializes all the value of columns together and so on
Row data are stored in contiguous pages in memory or on disk	Columns are stored in pages in memory or on disk

Suppose the records of a table are stored in the pages of memory. When they need to be accessed, these pages are brought to the primary memory, if they are not already present in the memory.

If one row occupies a page and we need all specific column such as salary or rate of interest from all the rows for some kind of analytics, each page containing the columns has to be brought in the memory; so this page in page out will result in a lot of I/O, which may result in prolonged processing time.

In column-oriented databases, each column will be stored in pages. If we need to fetch a specific column, there will be less I/O as only the pages that contain the specified column needed to be brought in the main memory and read, and we need not bring and read all the pages containing rows/records henceforth into the memory. So the kind of queries where we need to just fetch specific columns and not whole record(s) or sets is served best in column-oriented database, which is useful for analytics wherein we can fetch some columns and do some mathematical operations such as sum and average.

Pros and cons of column-oriented databases

The following are pros of column-oriented database:

This has built-in support for efficient and data compression.
This supports fast data retrieval.
Administration and configuration is simplified. It can be scaled out and hence is very easy to expand.
This is good for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX).
This is efficient for partitioning as it provides features of automatic sharding mechanism to distribute bigger regions to smaller ones.

The following are cons of column-oriented database:

Queries with JOINs and data from many tables are not optimized.
It records and deletes lot of updates and has to make frequent compaction and splits too. This reduces its storage efficiency.
Partitioning or indexing schemes can be difficult to design as a relational concept is not implicit.

Getting started with HBase

We will discuss this section in a bit questionnaire manner, and will come to understand HBase with the help of scenarios and conditions.

When it started

The following figure shows the flow of HBase: birth, growth, and current status:

It all started in 2006 when a company called Powerset (later acquired by Microsoft) was looking forward to building a natural language search engine. The person responsible for the development was Jim Kellerman and there were also many other contributors. It was modeled around the Google BigTable white paper that came out in 2006, which was running on Google File System (GFS).

It started with a TAR file with a random bunch of Java files with initial HBase code. It was first added to the contrib directory of Hadoop as a small subcomponent of Hadoop and with the dedicated effort for filling up gaps; it has slowly and steadily grown into a full-fledged project. It was first added with Hadoop 0.1.0 and as it become more and more feature rich and stable, it was promoted to Hadoop subproject and then slowly with more and more development and contribution from the HBase user and developer group, it has became one of the top-level projects at Apache.

The following figure shows HBase versions from the beginning till now:

Let's have a look at the year-by-year evolution of HBase's important features:

2006: The idea of HBase started with the white paper of Google BigTable
2007: Data compression on the per column family was made available, addition and deletion of column families online was added, script to start/stop HBase cluster was added, MapReduce connecter was added, HBase shell support was added, support of row and column filter was added, algorithm to distribute region was evenly added, first rest interface was added, and hosted the first HBase meetup
2008: HBase 0.1.0 to 0.18.1, HBase moved to new SVN, HBase added as a contribution to Hadoop, HBase become Hadoop subproject, first separate release became available, and Ruby shell added
2009: HBase 0.19.0 to 0.20.*, improvement in writing and scanning, batching of writing and scanning, block compression, HTable interface to REST, addition of binary comparator, addition of regular expression filters, and many more
2010 till date: HBase 0.89.* - 0.94.*, support for HDFS durability, improvement in import flow, support for MurmurHash3, addition of daemon threads for NameNode and DataNode to indicate the VM or kernel-caused pause in application log, tags support for key value, running MapReduce over snapshot files, introduction of transparent encryption of HBase on disk data, addition of per key-value security, offline rebuilding .META. from file system data, snapshot support, and many more
Note
For more information, just visit https://issues.apache.org/jira/browse/HBASE and explore more detailed explanations and more lists of improvements and the addition of new features.
0.96 to 1.0 and Future: HBase Version 1 and higher, add utility for adorning HTTP context, fully high availability with Hadoop HA, rolling upgrades, improved failure detection and recovery, cell-level access security, inline cell tagging, quota and grouping, reverse scan, rolling upgrade, and it will be more useful for analytics purposes and helpful for data scientists
Note
While we wait for new features in v1.0; we can always visit http://hbase.apache.org for the latest releases and features.
Here is the link from where we can download HBase versions:
http://apache.mirrors.tds.net/hbase/stable

Let's now discuss HBase and Hadoop compatibility and the features they provide together.

Prior to Hadoop v1.0, when DataNode used to crash, HBase Write-Ahead Log—the logfiles that maintain the read/write operation before the final writing is done to the MemStore—would be lost and hence the data too. This version of Hadoop integrated append branch into the main core, which increased the durability for HBase. Hadoop v1.0 has also implemented the facility of disk failure, making RegionServer more robust.

Hadoop v2.0 has integrated high availability of NameNode, which also enables HBase to be more reliable and robust by enabling the multiple HMaster instances. Now with this version of HBase, upgrading has become easy because it is made independent of HDFS upgrades. Let's see in the following table how recent versions of Hadoop have enhanced HBase on the basis of performance, availability, and features:

Criteria	Hadoop v1.0	Hadoop v2.0	Hadoop v2.x
Features	Durability using `hflush()`	`hsync()` snapshot hard linking (https://issues.apache.org/jira/browse/HDFS-3370) HBase-aware block placement
Performance	Short-circuit read	Native CRC Datanodekeepalive	Direct write API (HBase provides the utility classes and the ImportTSV tool itself to write directly into HFile. Then, using the `IncrementalLoadHFile`, these files are loaded into the regions managed by RS. Once these two steps are over, client can read the data normally) Zero copy API (in this operation, the CPU does not copy data from one memory area to another. This is used to save on processing power and memory when sending files over a network) Direct codec API (used at server side for writing cells to WAL as well as for sending edits as part of the distributed-splitting process)

Note

Miscellaneous features in newer HBase are HBase isolation and allocation, online-automated repair of table integrity and region consistency problems, dynamic configuration changes, reverse scanning (stop row to start row), and many other features; users can visit https://issues.apache.org/jira/browse/HBASE for features and advancement of each HBase release.

HBase components and functionalities

Here let's discuss various components of HBase and their components recursively:

ZooKeeper
HMaster
RegionServer
Client
Catalog tables

ZooKeeper

ZooKeeper is a high-performance, centralized, multicoordination service system for distributed application, which provides a distributed synchronization and group service to HBase.

It enables the users and developer to focus on the application logic and not on the coordination with the cluster, for which it provides some API that can be used by the developers to use and implement coordination task such as master server, and managing application and cluster communication system.

In HBase, ZooKeeper is used to elect a cluster master in order to keep track of available and online servers, and to keep the metadata of the cluster. ZooKeeper APIs provide:

Consistency, ordering, and durability
Synchronization
Concurrency for a distributed clustered system

The following figure shows ZooKeeper:

It was developed at Yahoo Research. And the reason behind the name ZooKeeper is that in Hadoop system, projects are based on animal names, and in discussion regarding naming this technology, this name emerged as it manages the availability and coordination between different components of a distributed system.

ZooKeeper not only simplifies the development but also sits on the top distributed system as an abstraction layer to facilitate the better reachability to the components of the system. The following figure shows the request and response flow:

Let's consider a scenario wherein we have a few people who want to fill 10 rooms with some items. One instance would be where we will show how they find their way to the room to keep the items. Some of the rooms will be locked, which will lead the people to move on to other rooms. The other instance would be where we can allocate some representatives with information about the rooms, condition of rooms, and state of rooms (open, closed, fit for storing, not fit, and so on). We can then send them with items to those representatives for the information. The representative will guide the person towards the right room, which is available for storage of items, and the person can directly move to the specified room and store the item. This will not only ease the communication and the storage process but also reduce the overhead from the process. The same technique can be applied in the case of the ZooKeepers.

ZooKeeper maintains a tree with ZooKeeper data internally called a znode. This can be of two types:

Ephemeral, which is good for applications that need to understand whether a specific distributed resource is available or not.
The persistent one will be stored till a client does not delete it explicitly and it stores some data of the application too.

Why an odd number of ZooKeepers?

ZooKeepers are based on a majority principle; it requires that we have a quorum of servers to be up, where quorum is ceil(n/2), for a cluster of three nodes ensemble means two nodes must be up and running at any point of time, and for five node ensemble, a minimum three nodes must be up. It's also important for election purpose for the ZooKeeper master. We will discuss more options of configuration and coding of ZooKeeper in later chapters.

HMaster

HMaster is the component of the HBase cluster that can be thought of as NameNode in the case of Hadoop cluster; likewise, it acts as a master for RegionServers running on different machines. It is responsible for monitoring all RegionServers in an HBase cluster and also provides an interface to all HBase metadata for the client operations. It also handles RegionServer failover, and region splits.

There may be more than one instance of HMaster in an HBase cluster that provides High Availability (HA). So, if we have more than one master, only one master is active at a time; at the start up time, all the masters compete to become the active master in the cluster and whichever wins becomes the active master of the cluster. Meanwhile, all other master instances remain passive till the active master crashes, shuts down, or loses a lease from the ZooKeeper.

In short, it is a coordination component in an HBase cluster, which also manages and enables us to perform an administrative task on the cluster.

Let's now discuss the flow of starting up the HMaster process:

Block (do not serve requests) until it becomes active HMaster.
Finish initialization.
Enter loop until stopped.
Do cleansing when it is stopped.

HMaster exports some of the following interfaces that are metadata-based methods to enable us to interact with HBase:

Related to	Facilities
HBase tables	Creating table, deleting table, enabling/disabling table, and modifying table
HBase column families	Adding columns, modifying columns, and removing columns
HBase table regions	Moving regions, assigning regions, and unassign regions

In HBase, there is a table called .META. (table name on file system), which keeps all information about regions that is referred by HMaster for information about the data. By default, HMaster runs on port number 60000 and its HTTP Web UI is available on port 60010, which can always be changed according to our need.

HMaster functionalities can be summarized as follows:

Monitors RegionServers
Handles RegionServers failover
Handles metadata changes
Assignment/unassignment of regions
Interfaces all metadata changes
Performs reload balancing in idle time
It publishes its location to client using ZooKeeper
HMaster Web UI provides all information about HBase cluster (table, regions, RegionServers and so on)

If a master node goes down

If master goes down, in this scenario, the cluster may continue working normally as clients talk directly to RegionServers. So, cluster may still function steadily. The HBase catalog table (.META. and -ROOT-) exists as HBase tables and it's not stored in master resistant memory. However, as master performs critical functions such as RegionServers' failovers and region splits, these functions may be hampered and if not taken care will create a huge setback to the overall cluster functioning, so the master must be started as soon as possible.

So now, Hadoop is HA enabled and thus HBase can always be made HA using multiple HMasters for better availability and robustness, so we can now consider having multiple HMaster.

RegionServer

RegionServers are responsible for holding the actual raw HBase data. Recall that in a Hadoop cluster, a NameNode manages the metadata and a DataNode holds the raw data. Likewise, in HBase, an HBase master holds the metadata and RegionServer's store. These are the servers that hold the HBase data, as we may already know that in Hadoop cluster, NameNode manages the metadata and DataNode holds the actual data. Likewise, in HBase cluster, RegionServers store the raw actual data. As you might guess, a RegionServer is run or is hosted on top of a DataNode, which utilizes the underlying DataNodes at underlying file system, that is, HDFS.

The following figure shows the architecture of RegionServer:

RegionServer performs the following tasks:

Serving regions(tables) assigned to it
Handling client read/write requests
Flushing cache to HDFS
Maintaining HLogs
Performing compactions
Responsible for handling region splits

Components of a RegionServer

The following are the components of RegionServers

Write-Ahead logs: This is also called edit. When data is read/modified to HBase, it's not directly written in the disk rather it is kept in memory for some time (threshold, which we can configure based on size and time). Keeping this data in memory may result in a loss of data if the machine goes down abruptly. So to solve this, the data is first written in an intermediate file, which is called Write-Ahead logfile and then in memory. So in the case of system failure, data can be reconstructed using this logfile.
HFile: These are the actual files where the raw data is stored physically on the disk. This is the actual store file.
Store: Here the HFile is stored. It corresponds to a column family for a table in HBase.
MemStore: This component is in memory data store; this resides in the main memory and records the current data operation. So, when data is stored in WAL, RegionServers stores key-value in memory store.
Region: These are the splits of HBase table; the table is divided into regions based on the key and are hosted by RegionServers. There may be different regions in a RegionServer.

We will discuss more about these components in the next chapter.

Client

Client is responsible for finding the RegionServer, which is hosting the particular row (data). It is done by querying the catalog tables. Once region is found, the client directly contacts RegionServers and performs the data operation. Once this information is fetched, it is cached by the client for further fast retrieval. The client can be written in Java or any other language using external APIs.

Catalog tables

There are two tables that maintain the information about all RegionServers and regions. This is a kind of metadata for the HBase cluster. The following are the two catalog tables that exist in HBase:

-ROOT-: This includes information about the location of .META. table
.META.: This table holds all regions and their locations

At the beginning of the start up process, the .mMeta location is set to root from where the actual metadata of tables are read and read/write continues. So, whenever a client wants to connect to HBase and read or write into table, these two tables are referred and information is returned to client for direct read and write to the RegionServers and the regions of the specific table.

Who is using HBase and why?

The following is a list of just a few companies that use HBase in production. There are many companies who are using HBase, so we will list a few and not all.

Adobe: They have an HBase cluster of 30 nodes and are ready to expand it. They use HBase in several areas from social services to structured data and processing for internal use.
Facebook: They use it for messaging infrastructure.
Twitter: They use it for a number of applications including people search, which relies on HBase internally for data generation, and also their operations team uses HBase as a time series database for cluster-wide monitoring/performance data.
Infolinks: They use it for process advertisement selection and user events for our in-text advertising network.
StumbleUpon: They use it with MapReduce data source to overcome traditional query speed limits in MySQL.
Trend Micro: They use it as cloud-based storage.
Yahoo!: They are use HBase to store document fingerprint for detecting near-duplicates. They have a cluster of a few nodes that run HDFS, MapReduce, and HBase. The table contains millions of rows; we use this for querying duplicated documents with real-time traffic.
Ancestry.com: This company uses it for DNA analysis.
UIDAI: This is an Indian government project; they use HBase for storing resident details.
Apache: They use it for maintaining wiki.
Mozilla: They are moving Socorro project to HBase.
eBay: They use HBase for indexing site inventory.

Note

And we can keep listing, but we will stop it here and for further information, please visit http://wiki.apache.org/hadoop/Hbase/PoweredBy.

When should we think of using HBase?

Using HBase is not the solution to all problems; however, it can solve a lot of problems efficiently. The first thing is that we should think about the amount of data; if we have a few million rows and a few read and writes, then we can avoid using it. However, think of billions of columns and thousands of read/write data operations in a short interval, we can surely think of using HBase.

Let's consider an example, Facebook uses HBase for its real-time messaging infrastructure and we can think of how many messages or rows of data Facebook will be receiving per second. Considering that amount of data and I/O, we can currently think of using HBase. The following list details a few scenarios when we can consider using HBase:

If data needs to have a dynamic or variable schema
If a number of columns contain more null values (blank columns)
When we have a huge number of dynamic rows
If our data contains a variable number of columns
If we need to maintain versions of data
If high scalability is needed
If we need in-built compression on records
If a high volume of I/O is needed

There are many other cases where we can use HBase and it can be beneficial, which is discussed in later chapters.

When not to use HBase

Let's now discuss some points when we don't compulsorily have to use HBase just because everyone else is using it:

When data is not in large amounts (in TBs and more)
When JOINs and relational DB features are needed
Don't go with the belief "every one is using it"
If RDBMS fits your requirements, use RDBMS

Understanding some open source HBase tools

The following is the list of some HBase tools that are available in the development world:

hbaseexplorer: This tool provides UI for HBase; using this tool, we can perform the following operations:
- Data visualization
- Table creation, deletion, and cloning
- Table statistics
- Scans
Note
For reference, go to http://sourceforge.net/projects/hbaseexplorer/.
Toad for Cloud Databases: This is the tool to connect to HBase and perform various functions.
Note
For reference, go to http://www.toadworld.com/products/toad-for-cloud-databases/default.aspx.
HareDB HBase Client: This is an HBase client, which can be used more easily with its user-friendly interface, which makes it a GUI tool for HBase (including PIG and high speed Hive Query)
Note
For reference, go to http://sourceforge.net/projects/haredbhbaseclie/.
hrider: The hrider is a UI application that provides an easier way to view or manipulate the data saved in the HBase.
Note
For reference, go to https://github.com/NiceSystems/hrider.
Hannibal: This is a tool for Apache HBase region monitoring.
Note
For reference, go to https://github.com/sentric/hannibal.
Performance Monitoring & Alerting (SPM): SPM is a proactive performance monitoring, anomaly detection, and alerting solution available in the Cloud (SaaS) as well as own premise. SPM can monitor Solr, Elasticsearch, Hadoop, HBase, ZooKeeper, Kafka, Storm, Redis, JVM, system metrics, custom metrics, and more.
Note
For reference, go to http://sematext.com/spm/.
Phoenix: This tool is a SQL skin over HBase, delivered as a client-embedded JDBC driver, powering the HBase use cases at Salesforce.com. Phoenix targets low-latency queries (milliseconds), as opposed to batch operation via MapReduce.
Note
For reference, go to https://github.com/forcedotcom/phoenix and http://phoenix.apache.org/.
Impala: Cloudera Impala is a parallel processing SQL query engine, which runs in Apache Hadoop. The Apache-licensed, open source Impala project combines modern, scalable, parallel database technology with the power of Hadoop. Users can directly query data stored in HDFS and Apache HBase without requiring data movement or transformation.
Note
For reference, go to http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Cloudera-Impala-Version-and-Download-Information/Cloudera-Impala-Version-and-Download-Information.html.

The Hadoop-HBase version compatibility table

As there are compatibility issues in almost all systems; likewise, HBase also has compatibility issues with Hadoop versions, which means all versions of HBase can't be used use on top of all Hadoop versions. The following is the version compatibility of Hadoop-HBase that should be kept in mind while configuring HBase on Hadoop (credit: Apache):

Hadoop versions	HBase 0.92.x	HBase 0.94.x	HBase 0.96.0	HBase 0.98.0
Hadoop 0.20.205	Supported	Not supported	Not supported	Not supported
Hadoop 0.22.x	Supported	Not supported	Not supported	Not supported
Hadoop 1.0.0-1.0.2	Supported	Supported	Not supported	Not supported
Hadoop 1.0.3+	Supported	Supported	Supported	Not supported
Hadoop 1.1.x	Not tested enough	Supported	Supported	Not supported
Hadoop 0.23.x	Not supported	Supported	Not tested enough	Not supported
Hadoop 2.0.x-alpha	Not supported	Not tested enough	Not supported	Not supported
Hadoop 2.1.0-beta	Not supported	Not tested enough	Supported	Not supported
Hadoop 2.2.0	Not supported	Not tested enough	Supported	Supported
Hadoop 2.x	Not supported	Not tested enough	Supported	Supported

We can always visit https://hbase.apache.org for more updated version compatibility between HBase and Hadoop.

Filter reviews by

All

Amazon verified reviews

Sören Mar 24, 2015

For a future project, I had to read up on both Hadoop and HBase. For HBase, I decided to purchase Learning HBase by Shashwat Shriparv. The author starts by introducing HBase and the ecosystem and later guides the reader through installation and configuration (of both the Apache and Cloudera flavours). Of course, the coding aspects such as scripting and Java development are not forgotten and there is a separate chapter on maintenance and troubleshooting, which is very handy to have. In my case, the decision for HBase had already been made, but the uses cases in the last chapter were nonetheless very interesting. The book is written in an accessible style with enough detail and coverage to get started with HBase. I am glad I bought it and can definitely recommend it as a starting point for the HBase beginner.

Amazon Verified review

D Mar 27, 2015

Learning HBASE book contains everything a beginner needs to get started with HBASE. The book provides the reader basic understanding of HBASE concepts as well as Hadoop and Zookeeper. The author does a nice job of walking through the reader with installing, running, using, and maintaining HBase.

Phani Mutyala Mar 24, 2015

I am in love with this book. The concept, the way the explanation has been give was solid. The best part I liked was the example that have been provided throughout the book and for every use case that was described. My Learning path on NoSQL got changed after my readings. I would certainly recommend this book to the audience that are beginners to NoSQL word.

Natalia Mar 20, 2015

This book has been a great source for me when I was studying for Cloudera HBase Specialist exam earlier this year (which I passed). I would say the audience for this book is people who have been exposed to various Big Data technologies and have good understanding of HDFS. Which was great for me because I did not have to skip sections that are often added to the books as an intro to Hadoop ecosystem. "HBase Use Cases" portion was very informative for me personally. Chapter on Coding HBase in Java covers material that would generally interest development team but I found it very useful for me (my duties can be mostly described as Hadoop administration) because information there is tied to HBase API topics that - as an admin - you should know.

varun Mar 16, 2015

I liked this book,this book is very useful for developer and administrator.He has covered all the issue and challenges faced in real time and best practices in implementation and developing of Hbase.

Learning Hbase: Learn the fundamentals of HBase administration and development with the help of real-time scenarios

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Note

Core daemons of Hadoop

Comparing HBase with Hadoop

Pros and cons of column-oriented databases

Note

Note

Why an odd number of ZooKeepers?

HMaster

If a master node goes down

RegionServer

Components of a RegionServer

Client

Catalog tables

Note

Note

Note

Note

Note

Note

Note

Note

Description

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access