Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 7. Widely Used Hadoop Ecosystem Components

Since the invention of Hadoop, many tools have been developed around the Hadoop ecosystem. These tools are used for data ingestion, data processing, and storage, solving some of the problems Hadoop initially had. In this section, we will be focusing on Apache Pig, which is a distributed processing tool built on top of MapReduce. We will also look into two widely used ingestion tools, namely Apache Kafka and Apache Flume. We will discuss how they are used to bring data from multiple sources. Apache Hbase will be described in this chapter. We will cover the architecture details and how it fits into the CAP theorem. In this chapter, we will cover the following topics:

Apache Pig architecture
Writing custom user-defined functions (UDF) in Pig
Apache HBase walkthrough
CAP theorem
Apache Kafka internals
Building producer and consumer applications
Apache Flume and its architecture
Building custom source, sink, and interceptor
Example of bringing data...

Technical requirements

You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter07

Check out the following video to see the code in action:http://bit.ly/2NyN1DX

Pig

Hadoop had MapReduce as a processing engine when it first started and Java was the primary language that was used for writing MapReduce jobs. Since Hadoop was mostly used as an analytics processing framework, large chunks of use cases involved data mining on legacy data warehouses. These data warehouse applications were migrated to use Hadoop. Most users using legacy data warehouses had SQL and that was their core expertise. Learning a new programming language was time-consuming. Therefore, it is better to have a framework that can help SQL skilled people to write MapReduce jobs in an SQL-like language. Apache Pig was invented for this purpose. It also solved the complexity of writing multiple MapReduce pipeline jobs where output of one job becomes the input to another. the

Note

Apache Pig is a distributed processing tool that is an abstraction over MapReduce and is used to process large datasets representing data flows. Apache Pig on Apache Spark is also an option that the open source...

HBase

Although Hadoop was getting popular after its invention, it was still only suitable for batch processing use cases where a huge set of data could be processed in a single batch. Hadoop came from the Google research paper called Hadoop Distributed File System (HDFS) from the Google File System Research paper and MapReduce from the Google MapReduce research paper. Google has one more popular product, which is Big Table, and to support random read/write access over large sets of data, HBase was discovered. HBase runs on top of Hadoop and uses the scalability of Hadoop by running its daemon—HDFS, with real-time data access as a key/value store.

Note

Apache HBase is an open source, distributed, NoSQL database that provides real-time random read/write access to large datasets over HDFS.

HBase architecture and its concept

Apache HBase is a distributed column storage database that also follows the master/slave architecture. Below is a picture representation of HBase architecture and its components...

Kafka

LinkedIn portal will be the most used portal in your professional career. The Kafka system was first introduced by the LinkedIn technical team. LinkedIn built a software metrics tool using custom in-house components with minor support from existing open source tools. The system collected user activity data on the data portal. It also used this activity data to show relevant information to each user on the web portal. The system was originally built as a traditional XML-based logging service, that was processed with different extract transform load (ETL) tools. However, as this arrangement did not work, they started running into various problems. To solve these problems, they built a system called Kafka. LinkedIn built Kafka as a distributed, fault-tolerant, publish/subscribe system. It records messages organized into topics. Applications can produce or consume messages from topics. All messages are stored as logs to persistent file systems. Kafka is a WAL system that writes all published...

Flume

The first step in any data pipeline is data ingestion, which brings data from the source system for necessary processing. There are different types of source systems available and to bring data from these source systems there are different specific tools available. Big Data ecosystem has its own setup of tools to bring data from these systems, for example, sqoop can be used to bring data from relational databases, Gobblin can bring data from relational databases, REST API, FTP server, and so on.

Apache flume is a Java-based, distributed, scalable, fault tolerant system to consume data from a streaming source, such as Twitter, log server, and so on. At one time it was a widely used application in different use case and still large numbers of pipeline use Flume (specifically as a producer to Kafka).

Apache Flume architecture

The producer and consumer problem was there even before Hadoop. The common problem that the producer/consumer faces is the producer producing data faster than the consumer...

Summary

In this chapter, we learned about detailed architecture and uses of a few widely used Hadoop components, such as Apache Pig, Apache HBase, Apache Kafka, and Apache Flume. We have covered a few examples around writing custom UDF in Apache Pig, writing custom source, sink, and interceptor in Apache Flume, writing producer and consumer in Apache Kafka, and so on. Our focus was also to see how we can install and set up these componentS for practical use.

In the next chapter, our focus will be to look into some advanced topics in Big Data and cover some useful fundamental techniques and concepts, such as compression, file format serialization techniques, and some important pillars of data governance.