Reader small image

You're reading from  HBase Essentials

Product typeBook
Published inNov 2014
Reading LevelIntermediate
Publisher
ISBN-139781783987245
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Nishant Garg
Nishant Garg
author image
Nishant Garg

Nishant Garg has over 17 years' software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a technical architect for the Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant has enjoyed working with some of the most recognizable names in IT services and financial industries, employing full software life cycle methodologies such as Agile and SCRUM. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Apache Kafka and HBase Essentials, Packt Publishing.
Read more about Nishant Garg

Right arrow

Chapter 4. The HBase Architecture

In the previous chapters, we learned the basic building blocks of HBase schema designing and applying the CRUD operations over the designed schema. In this chapter, we will look at HBase from its architectural view point on the following topics:

  • Data storage

  • Data replication

  • Securing HBase

For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security, and so on).

By the end of this chapter, we will also get an insight into the integration of HBase and Map Reduce. Let's start with data storage in HBase first.

Data storage


In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage:

  • HLog (the write-ahead log file, also known as WAL)

  • HFile (the real data storage file)

In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables.

Note

In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table...

Data replication


Data replication is copying data from one cluster to another cluster by replicating the writes as the first cluster received it. Intercluster (geographically apart as well) replication in HBase is achieved by log shipping asynchronously. Data replication serves as a disaster recovery solution and also provides higher availability at the HBase layer.

The master-push pattern used by HBase replication keeps track of what is currently being replicated as each region server has its own write-ahead log. One master cluster can replicate any number of slave clusters. Each region server will participate to replicate its own batch (the default size is 64 MB) of write-ahead edit records contained within WAL.

The master-push pattern used for cluster replication can be designed in three different ways:

  • Master-slave replication: In this type of replication, all the writes go to the primary cluster (master) first and then are replicated to the secondary cluster (slave). This type of enforcement...

Securing HBase


With the default configuration, HBase does not provide any kind of data security. Even with the firewalls in place, HBase is not able to differentiate between multiple users coming from the same client, and uniform data access is provided to all the users. From HBase Version 0.92 onwards, HBase provides optional support for both user authentication and authorization. For user authentication, it provides integration points with Kerberos and for authorization, it provides access an controller coprocessor.

Note

Kerberos is a networked authentication protocol designed to provide strong authentication for client/server applications by using secret-key cryptography. Kerberos uses Kerberos Key Distribution Center (KDC) as the authentication server and access ticket granting server. The setup of KDC is not in the scope of this book.

The access controller coprocessor is only implemented at the RPC level, and it is based on the Simple Authentication and Security Layer (SASL); the SASL...

HBase and MapReduce


HBase has a close integration with Hadoop's MapReduce as it is built on top of the Apache Hadoop framework. Hadoop's MapReduce provides a distributed computation for high throughput data access, and Hadoop Distributed File System (HDFS) provides HBase with the storage layer with high availability, reliability, and durability for data.

Before we go into more details of how HBase integrates with Hadoop's MapReduce framework, let's first understand how this framework actually works.

Hadoop MapReduce

There should be a system to process terabytes or petabytes of data and increase its performance linearly with the number of physical machines added. Apache Hadoop's MapReduce framework is designed to provide linearly scalable processing power for huge amounts of Big Data.

Let's discuss how MapReduce processes the data described in the preceding diagram. In MapReduce, the first step is the split process, which is responsible for dividing the input data into reasonably sized chunks...

Summary


In this chapter, we have learned the internals of HBase about how it stores the data. We also learned the basics of HBase cluster replication. In the last part, we got an overview of Hadoop MapReduce and covered the MapReduce execution over HBase using examples.

In the next chapter, we will look into the HBase advanced API used for counters and coprocessors, along with advanced configurations.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
HBase Essentials
Published in: Nov 2014Publisher: ISBN-13: 9781783987245
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nishant Garg

Nishant Garg has over 17 years' software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a technical architect for the Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant has enjoyed working with some of the most recognizable names in IT services and financial industries, employing full software life cycle methodologies such as Agile and SCRUM. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Apache Kafka and HBase Essentials, Packt Publishing.
Read more about Nishant Garg