Packt+ | Advance your knowledge in tech

You're reading from HBase Essentials

Product typeBook

Published inNov 2014

Reading LevelIntermediate

Publisher

ISBN-139781783987245

Edition1st Edition

Languages

Java

Tools

HBase

Concepts

Databases

Author (1)

Nishant Garg

Chapter 4. The HBase Architecture

In the previous chapters, we learned the basic building blocks of HBase schema designing and applying the CRUD operations over the designed schema. In this chapter, we will look at HBase from its architectural view point on the following topics:

Data storage
Data replication
Securing HBase

For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security, and so on).

By the end of this chapter, we will also get an insight into the integration of HBase and Map Reduce. Let's start with data storage in HBase first.

Data storage

In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage:

HLog (the write-ahead log file, also known as WAL)
HFile (the real data storage file)

In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables.

Note

In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table...

Data replication

Data replication is copying data from one cluster to another cluster by replicating the writes as the first cluster received it. Intercluster (geographically apart as well) replication in HBase is achieved by log shipping asynchronously. Data replication serves as a disaster recovery solution and also provides higher availability at the HBase layer.

The master-push pattern used by HBase replication keeps track of what is currently being replicated as each region server has its own write-ahead log. One master cluster can replicate any number of slave clusters. Each region server will participate to replicate its own batch (the default size is 64 MB) of write-ahead edit records contained within WAL.

The master-push pattern used for cluster replication can be designed in three different ways:

Master-slave replication: In this type of replication, all the writes go to the primary cluster (master) first and then are replicated to the secondary cluster (slave). This type of enforcement...

Securing HBase

With the default configuration, HBase does not provide any kind of data security. Even with the firewalls in place, HBase is not able to differentiate between multiple users coming from the same client, and uniform data access is provided to all the users. From HBase Version 0.92 onwards, HBase provides optional support for both user authentication and authorization. For user authentication, it provides integration points with Kerberos and for authorization, it provides access an controller coprocessor.

Note

Kerberos is a networked authentication protocol designed to provide strong authentication for client/server applications by using secret-key cryptography. Kerberos uses Kerberos Key Distribution Center (KDC) as the authentication server and access ticket granting server. The setup of KDC is not in the scope of this book.

The access controller coprocessor is only implemented at the RPC level, and it is based on the Simple Authentication and Security Layer (SASL); the SASL...

HBase and MapReduce

HBase has a close integration with Hadoop's MapReduce as it is built on top of the Apache Hadoop framework. Hadoop's MapReduce provides a distributed computation for high throughput data access, and Hadoop Distributed File System (HDFS) provides HBase with the storage layer with high availability, reliability, and durability for data.

Before we go into more details of how HBase integrates with Hadoop's MapReduce framework, let's first understand how this framework actually works.

Hadoop MapReduce

There should be a system to process terabytes or petabytes of data and increase its performance linearly with the number of physical machines added. Apache Hadoop's MapReduce framework is designed to provide linearly scalable processing power for huge amounts of Big Data.

Let's discuss how MapReduce processes the data described in the preceding diagram. In MapReduce, the first step is the split process, which is responsible for dividing the input data into reasonably sized chunks...

Summary

In this chapter, we have learned the internals of HBase about how it stores the data. We also learned the basics of HBase cluster replication. In the last part, we got an overview of Hadoop MapReduce and covered the MapReduce execution over HBase using examples.

In the next chapter, we will look into the HBase advanced API used for counters and coprocessors, along with advanced configurations.

The rest of the chapter is locked

You have been reading a chapter from

HBase Essentials

Published in: Nov 2014Publisher: ISBN-13: 9781783987245

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nishant Garg

Nishant Garg has over 17 years' software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum). He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a technical architect for the Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant has enjoyed working with some of the most recognizable names in IT services and financial industries, employing full software life cycle methodologies such as Agile and SCRUM. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Apache Kafka and HBase Essentials, Packt Publishing.
Read more about Nishant Garg

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages