Reader small image

You're reading from  Apache Hadoop 3 Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788999830
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Hrishikesh Vijay Karambelkar
Hrishikesh Vijay Karambelkar
author image
Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar

Right arrow

Demystifying Hadoop Ecosystem Components

We have gone through the Apache Hadoop subsystem in detail in previous chapters. Although Hadoop is extensively known for its core components such as HDFS, MapReduce and YARN, it also offers a whole ecosystem that is supported by various components to ensure all your business needs are addressed end-to-end. One key reason behind this evolution is because Hadoop's core components offer processing and storage in a raw form, which requires an extensive amount of investment when building software from a grass-roots level.

The ecosystem components on top of Hadoop can therefore provide the rapid development of applications, ensuring better fault-tolerance, security, and performance over custom development done on Hadoop.

In this chapter, we cover the following topics:

  • Understanding Hadoop's Ecosystem
  • Working with Apache Kafka
  • Writing...

Technical requirements

You will need Eclipse development environment and Java 8 installed on your system where you can run/tweak these examples. If you prefer to use maven, then you will need maven installed to compile the code. To run the example, you also need Apache Hadoop 3.1 setup on Linux system. Finally, to use the Git repository of this book, you need to install Git.

The code files of this chapter can be found on GitHub:
https://github.com/PacktPublishing/Apache-Hadoop-3-Quick-Start-Guide/tree/master/Chapter7

Check out the following video to see the code in action:

http://bit.ly/2SBdnr4

Understanding Hadoop's Ecosystem

Hadoop is often used for historical data analytics, although a new trend is emerging where it is used for real-time data streaming as well. Considering the offerings of Hadoop's ecosystem, we have broadly categorized them into the following categories:

  • Data flow: This includes components that can transfer data to and from different subsystems to and from Hadoop including real-time, batch, micro-batching, and event-driven data processing.
  • Data engine and frameworks: This provides programming capabilities on top of Hadoop YARN or MapReduce.
  • Data storage: This category covers all types of data storage on top of HDFS.
  • Machine learning and analytics: This category covers big data analytics and machine learning on top of Apache Hadoop.
  • Search engine: This category covers search engines in both structured and unstructured Hadoop data.
  • Management...

Working with Apache Kafka

Apache Kafka provides a data streaming pipeline across the cluster through its message service. It ensures a high degree of fault tolerance and message reliability through its architecture, and it also guarantees to maintain message ordering from a producer. A record in Kafka is a (key-value) pair along with a timestamp and it usually contains a topic name. A topic is a category of records on which the communication takes place.

Kafka supports producer-consumer-based messaging, which means producers can produce messages that can be sent to consumers. It maintains a queue of messages, where there is also an offset that represents its position or index. Kafka can be deployed on a multi-node cluster, as shown in the following diagram, where two producers and three consumers have been used as an example:

Producers produce multiple topics through producer...

Writing Apache Pig scripts

Apache Pig allows users to write custom scripts on top of the MapReduce framework. Pig was founded to offer flexibility in terms of data programming over large data sets and non-Java programmers. Pig can apply multiple transformations on input data in order to produce output on top of a Java virtual machine or an Apache Hadoop multi-node cluster. Pig can be used as a part of ETL (Extract Transform Load) implementations for any big data project.

Setting up Apache Pig in your Hadoop environment is relatively easy compared to other software; all you need to do is download the Pig source and build it to a pig.jar file, which can be used for your programs. Pig-generated compiled artifacts can be deployed on a standalone JVM, Apache Spark, Apache Tez, and MapReduce, and Pig supports six different execution environments (both local and distributed). The respective...

Transferring data with Sqoop

The beauty of Apache Hadoop lies in its ability to work with multiple data formats. HDFS can reliably store information flowing from a variety of data sources, whereas Hadoop requires external interfaces to interact with storage repositories outside of HDFS. Sqoop helps you to address part of this problem by allowing users to extract structured data from a relational database to Apache Hadoop. Similarly, raw data can be processed in Hadoop, and the final results can be shared with traditional databases thanks to Sqoop's bidirectional interfacing capabilities.

Sqoop can be downloaded from the Apache site directly, and it supports client-server-based architecture. A server can be installed on one of the nodes, which then acts as a gateway for all Sqoop activities. A client can be installed on any machine, which will eventually connect with the server...

Writing Flume jobs

Apache Flume offers the service to feed logs containing unstructured information back to Hadoop. Flume works across any type of data source. Flume can receive both log data or continuous event data, and it consumes events, incremental logs from sources such as the application server, and social media events.

The following diagram illustrates how Flume works. When flume receives an event, it is persisted in a channel (or data store), such as a local file system, before it is removed and pushed to the target by Sink. In the case of Flume, a target can be HDFS storage, Amazon S3, or another custom application:

Flume also supports multipleFlume agents, as shown in the preceding data flow. Data can be collected, aggregated together, and then processed through a multi-agent complex workflow that is completely customizable by the end user. Flume provides message reliability...

Understanding Hive

Apache Hive was developed at Facebook to primarily address the data warehousing requirements of the Hadoop platform. It was created to utilize analysts with strong SQL capabilities to run queries on the Hadoop cluster for data analytics. Although we often talk about going unstructured and using NoSQL, Apache Hive still fits in with today's information landscape regarding big data.

Apache Hive provides an SQL-like query language called HiveQL. Hive queries can be deployed on MapReduce, Apache Tez, and Apache Spark as jobs, which in turn can utilize the YARN engine to run programs. Just like RDBMS, Apache Hive provides indexing support with different index types, such as bitmap, on your HDFS data storage. Data can be stored in different formats, such as ORC, Parquet, Textfile, SequenceFile, and so on.

Hive querying also supports extended User Defined Functions...

Using HBase for NoSQL storage

Apache HBase provides a distributed, columnar key-value-based storage on Apache Hadoop. It is best suited when you need to perform read-writes randomly on large and varying data stores. HBase is capable of distributing and sharding its data across multiple nodes of Apache Hadoop, and it also provides high availability through its automatic failover from one region server to another. Apache HBase can be run in two modes: standalone and distributed. In the standalone mode, HBase does not use HDFS and instead uses a local directory by default, whereas the distributed mode works on HDFS.

Apache HBase stores its data across multiple rows and columns, where each row consists of a row key and a column containing one or more values. A value can be one or more attributes. Column families are sets of columns that are collocated together for performance reasons...

Summary

In this chapter, we studied the different components of Hadoop's overall ecosystem and their tools for solving many complex industrial problems. We went through a brief overview of the tools and software that run on Hadoop, specifically Apache Kafka, Apache PIG, Apache Sqoop, and Apache Flume. We also covered SQL and NoSQL-based databases on Hadoop, which included Hive and HBase respectively.

In the next chapter, we will take a look at some analytics components along with more advanced topics in Hadoop.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hadoop 3 Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781788999830
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar