Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (23) Chapters

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Journey to Hadoop 3 Deep Dive into the Hadoop Distributed File System YARN Resource Management in Hadoop Internals of MapReduce SQL on Hadoop Real-Time Processing Engines Widely Used Hadoop Ecosystem Components Designing Applications in Hadoop Real-Time Stream Processing in Hadoop Machine Learning in Hadoop Hadoop in the Cloud Hadoop Cluster Profiling Who Can Do What in Hadoop Network and Data Security Monitoring Hadoop Other Books You May Enjoy Index

Chapter 7. Widely Used Hadoop Ecosystem Components

Since the invention of Hadoop, many tools have been developed around the Hadoop ecosystem. These tools are used for data ingestion, data processing, and storage, solving some of the problems Hadoop initially had. In this section, we will be focusing on Apache Pig, which is a distributed processing tool built on top of MapReduce. We will also look into two widely used ingestion tools, namely Apache Kafka and Apache Flume. We will discuss how they are used to bring data from multiple sources. Apache Hbase will be described in this chapter. We will cover the architecture details and how it fits into the CAP theorem. In this chapter, we will cover the following topics:

  • Apache Pig architecture 
  • Writing custom user-defined functions (UDF) in Pig
  • Apache HBase walkthrough 
  • CAP theorem
  • Apache Kafka internals 
  • Building producer and consumer applications
  • Apache Flume and its architecture
  • Building custom source, sink, and interceptor
  • Example of bringing data...

Technical requirements


You will be required to have basic knowledge of Linux and Apache Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter07

Check out the following video to see the code in action:http://bit.ly/2NyN1DX

Pig


Hadoop had MapReduce as a processing engine when it first started and Java was the primary language that was used for writing MapReduce jobs. Since Hadoop was mostly used as an analytics processing framework, large chunks of use cases involved data mining on legacy data warehouses. These data warehouse applications were migrated to use Hadoop. Most users using legacy data warehouses had SQL and that was their core expertise. Learning a new programming language was time-consuming. Therefore, it is better to have a framework that can help SQL skilled people to write MapReduce jobs in an SQL-like language. Apache Pig was invented for this purpose. It also solved the complexity of writing multiple MapReduce pipeline jobs where output of one job becomes the input to another.  the

Note

Apache Pig is a distributed processing tool that is an abstraction over MapReduce and is used to process large datasets representing data flows. Apache Pig on Apache Spark is also an option that the open source...

HBase


Although Hadoop was getting popular after its invention, it was still only suitable for batch processing use cases where a huge set of data could be processed in a single batch. Hadoop came from the Google research paper called Hadoop Distributed File System (HDFS) from the Google File System Research paper and MapReduce from the Google MapReduce research paper. Google has one more popular product, which is Big Table, and to support random read/write access over large sets of data, HBase was discovered. HBase  runs on top of Hadoop and uses the scalability of Hadoop by running its daemon—HDFS, with real-time data access as a key/value store. 

Note

Apache HBase is an open source, distributed, NoSQL database that provides real-time random read/write access to large datasets over HDFS.

HBase architecture and its concept

Apache HBase is a distributed column storage database that also follows the master/slave architecture. Below is a picture representation of HBase architecture and its components...

Kafka


LinkedIn portal will be the most used portal in your professional career. The Kafka system was first introduced by the LinkedIn technical team. LinkedIn built a software metrics tool using custom in-house components with minor support from existing open source tools. The system collected user activity data on the data portal. It also used this activity data to show relevant information to each user on the web portal. The system was originally built as a traditional XML-based logging service, that was processed with different extract transform load (ETL) tools. However, as this arrangement did not work, they started running into various problems. To solve these problems, they built a system called Kafka. LinkedIn built Kafka as a distributed, fault-tolerant, publish/subscribe system. It records messages organized into topics. Applications can produce or consume messages from topics. All messages are stored as logs to persistent file systems. Kafka is a WAL system that writes all published...

Flume


The first step in any data pipeline is data ingestion, which brings data from the source system for necessary processing. There are different types of source systems available and to bring data from these source systems there are different specific tools available. Big Data ecosystem has its own setup of tools to bring data from these systems, for example, sqoop can be used to bring data from relational databases, Gobblin can bring data from relational databases, REST API, FTP server, and so on.

Apache flume is a Java-based, distributed, scalable, fault tolerant system to consume data from a streaming source, such as Twitter, log server, and so on. At one time it was a widely used application in different use case and still large numbers of pipeline use Flume (specifically as a producer to Kafka).

Apache Flume architecture

The producer and consumer problem was there even before Hadoop. The common problem that the producer/consumer faces is the producer producing data faster than the consumer...

Summary


In this chapter, we learned about detailed architecture and uses of a few widely used Hadoop components, such as Apache Pig, Apache HBase, Apache Kafka, and Apache Flume. We have covered a few examples around writing custom UDF in Apache Pig, writing custom source, sink, and interceptor in Apache Flume, writing producer and consumer in Apache Kafka, and so on. Our focus was also to see how we can install and set up these componentS for practical use. 

In the next chapter, our focus will be to look into some advanced topics in Big Data and cover some useful fundamental techniques and concepts, such as compression, file format serialization techniques, and some important pillars of data governance. 

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}