Packt+ | Advance your knowledge in tech

You're reading from Data Lake for Enterprises

Product type Book

Published in May 2017

Publisher Packt

ISBN-13 9781787281349

Pages 596 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Table of Contents (23) Chapters

Title Page

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Part 1 - Overview

Part 2 - Technical Building blocks of Data Lake

Part 3 - Bringing It All Together

Introduction to Data

Comprehensive Concepts of a Data Lake

Lambda Architecture as a Pattern for Data Lake

Applied Lambda for Data Lake

Data Acquisition of Batch Data using Apache Sqoop

Data Acquisition of Stream Data using Apache Flume

Messaging Layer using Apache Kafka

Data Processing using Apache Flink

Data Store Using Apache Hadoop

Indexed Data Store using Elasticsearch

Data Lake Components Working Together

Data Lake Use Case Suggestions

Chapter 6. Data Acquisition of Stream Data using Apache Flume

To continue with the approach of exploring various technologies and layer in Data Lakes, this chapter aims to cover another technology being used in the data acquisition layer. Similar to the previous chapter (and, in fact, every other chapter in this part of the book), we will first start with the overall context in purview of Data Lake and then delve deep into the selected technology.

Before delving deep into the chosen technology, we will give our reasons for choosing this technology and also will familiarize you with adequate details so that you are acquainted with enough details to go back to your enterprise and start actually using these technologies in action.

This chapter deals with Apache Flume, the second technology in the data acquisition layer. We will start off lightly on Apache Flume and then dive deep into the nitty-gritties. Finally we will show you a working Flume example--linking with our SCV use case. The final...

Context in Data Lake: data acquisition

One of the V’s of Big Data makes this chapter significant in all aspects in the modern era of any enterprise, namely Velocity. Traditionally, analytics was all done on data collected in the form of data (slow data), but nowadays analytics is done on data flowing in real time and then acted upon in real time to make a meaningful contribution to the business. The business outcome can be in the form of acting on a live Twitter stream of a customer to enhance customer experience or showing up a personalized offer by looking at some of his recent actions on your website. In this chapter, we will be covering mainly the Data Acquisition part of real-time data in our Data Lake.

In Chapter 5, Data Acquisition of Batch Data with Apache Sqoop we have detailed what the Data Acquisition layer is, so I won't be covering that in this section. However there is a significant difference between the data handled in Chapter 5, Data Acquisition of Batch Data with Apache...

Why Flume?

This section is dedicated explain you why we have chosen Flume as our technical choice in the technical capability that we look to realize Data Acquisition layer for handling stream/real time data.

With the following subsections, we will first dive into the history and then into Flume’s advantages as well as disadvantages. The advantages detailed are the main reasons for our choice of this technology for dealing with transfer of real-time data into Hadoop.

History of Flume

Apache Flume was developed by Cloudera for handling and moving large amount data produced into Hadoop. Without minimum or no delay (NRT: Near Real Time or Real time) the company wanted the data produced to be moved to Hadoop system, for various analysis to be carried. That was how this beautiful came into existence.

As detailed in previous section, it was initially conceived and developed to take care of a particular use case of collecting and aggregating log data from various source (web servers) into Hadoop for...

Flume architecture principles

Any technology piece to be successful, should have clearly defined architecture principles based on which its design is created and then evolved throughout. Flume also comes in one such software and up next are some of the architecture principles based on which Flume was designed (some of those got introduced as part of Flume NG):

Reliability: The capability of continuously accepting stream data and events without losing any data, in a variety of failure scenarios (mostly partial failures). One of the core architecture principles taken very seriously by Flume is fault-tolerance, which means that even if some components fail or misbehave, some hardware issues pop up, or if bandwidth or network behaves bad, Flume will accept these as facts of life in most cases and carry on doing its main job without shutting down completely. Flume does guarantee that the data reaching the Flume Agent will eventually be handed over to other components as long as the agent is kept...

The Flume Architecture

We discussed in previous section, the architecture principles based on which Flume was conceived, now let's deep dive into the architecture. Let's start off with a very basic diagram detailing the architecture of Flume (Figure 06) and then in the following sections keep diving deep.

Figure 06: Basic Flume Architecture

A simple Flume architecture has three important components, which work together to transfer a data from source to destination in real time fashion (stream or log data). They are:

Source: The responsibility of listening to stream data or events and then putting it to the channel
Channel: A pipe where events are stored until it has been taken by someone else
Sink: The responsibility of taking away events from the channel for further processing (sending to another source) or persisting to a data store. If sink operation fails, it will keep trying until success.

The following table summarizes some of examples for each of the components in the Flume architecture...

Flume event - Stream Data

Event is the unit of data which is send across the Flume pipeline. The structure of the event is quite simple and had two parts to it namely:

Event header: A Key/Value pair in the form Map<String, String>. These headers are meant to add more data about the event. For example, these headers can hold severity and priority aspects of this event, and so on. These headers can also contain UUID or event ID which distinguishes one event from the other.
Event payload: An array of bytes (byte array) in the form byte[]. 32 KB is the default body size, which is usually truncated after that figure but this is a configurable value in Flume.

This figure shows the internal structure of the Flume event, which hops from one agent to another in Flume:

Figure 13: Anatomy of a Flume event

Flume agent

Flume agent is the smallest possible deployment comprising of Source, Channel and Sink as its main components. The following figure shows a typical Flume agent deployment:

Figure 14: Flume Agent components

Flume agent is a Java daemon which received event from a source and then passes onto a channel, where it is usually written to the disk (according to reliability level set) and then moves the event to the sink. When the sink receives the event it sends acknowledgement back to channel and channel erases the event from its store. The agent has a very small memory footprint (-Xmx20m) and can be controlled declaratively using configurations.

Flume agent configurations

Some of these aspects have been unintentionally discussed in details in the Flume architecture section, however we thought that separate section for these agent configuration is required. Since we don't want to repeat ourselves, we will be referring some aspects back to that section.

The following are main configurations...

Flume source

Flume agent can have multiple sources, but it is mandatory to have at least one source for it to function. The source is managed by Source Runner which controls the threading aspect and execution models namely:

Event-driven and
Polling

In event-driven execution model the source listens and consumes events. In polling execution model the source keeps polling for events and then deal with it.

The event (as detailed earlier) can take a variety of content satisfying the event schema (header and payload). The source, complying with the architecture principle of extensibility, works on plugin approach. The source requires mandated name and type. According to the type, source will demand additional parameters and accordingly configurations have to set for it to work fine. The source can accept single event or a batch of event (mostly and in ideal case micro-batch as opposed to regular batch). Built-in sources in Flume can be broadly classified as:

Asynchronous sources: Client sending the...

Flume Channel

A channel is a mechanism used by the Flume agent to transfer data from source to sink. The events are persisted in the channel and until it is delivered/taken away by a sink, they reside in the channel. This persistence in channel allows sink to retry for each event in case there is a failure while persisting data to the real store (HDFS).

Channels can be broadly categorized into two:

In-memory: The events are available until the channel component is alive:
- Queue: In-memory queues in the channel. This has the lowest latency time for processing because the events are persisted in memory.
Durable: Even after the component is dead, the event persisted is available, and when the component becomes online, these events will be processed:
- File (WAL or Write-Ahead Log): The most used channel type. It's durable and requires disk to be RAID, SAN or similar.
- JDBC: A proper RDBMS backed channel that provides ACID compliance.
- Kafka: stored in Kafka cluster.

There is another special channel called...

Flume sink

Similar to the source, the sink is managed by SinkRunne, which manages the thread and execution model. Unlike a source, however, a sink is polling-based and polls the channel for events. The sink is the component that outputs (according to type of output required) it from the agent to an external or other source. Sinks also participate in transaction management, and when the output from a sink is successful, an acknowledgement is passed back to the channel. The channel then takes the event away from the persistence mechanism. Transaction management will be covered in detail in a separate section.

There are a variety of existing sinks available, as follows:

HDFS: Write to HDFS. This currently supports writing text and sequence files (in compressed format as well). The following is a sample HDFS sink configuration (taken from Flume user guide) for an agent named a1. The full configuration can be found in the Flume user guide (https://flume.apache.org):

a1.channels = c1
a1.sinks = k1...

Flume configuration

Flume can be fully configured using the flume configuration file. A single image speaks more than thousand words, so we will like to explain Flume configuration using the following figure. An exhaustive flume configuration is out of scope of this book, but will explain some core aspects of how the flume can be configured and this can be base for understanding a full-fledged configuration.

Figure 17: Flume Configuration Tree (sample)

The next code block shows the preceding configuration tree figure:

# Active Flume Components
flumeAgent1.sources=source1
flumeAgent1.channels=channel1
flumeAgent1.sinks=sink1

# Define and Configure Source 1
flumeAgent1.sources.source1.type=netcat
flumeAgent1.sources.source1.channels=channel1
flumeAgent1.sources.source1.bind=127.0.0.1
flumeAgent1.sources.source1.port=10010

# Define and Configure Sink 1
flumeAgent1.sinks.sink1.type=logger
flumeAgent1.sinks.sink1.channels=channel1

# Define and Configure Channel 1
flumeAgent1.channels.channel1...

Flume transaction management

Throughout the previous sections we have indeed transaction aspects at various stages. The following figure summarizes these discussions in a more pictorial fashion:

Figure 18: Transaction management in Flume (Source Tx and Sink Tx)

This figure shows that incoming data from a client or previous sink starts the present agent transaction and this is termed as Source Tx in the figure. The Source Tx ends soon after the event is persisted in the channel and acknowledgement received.

In purview of an agent a second transaction kicks in termed as Sink Tx which start with the data being polled by the sink and when the data is successfully transferred, channel uses the acknowledgement to remove the data in the channel.

Flume does have transaction management in all aspects and according to use case various reliability levels can be set in channel which decides how the transaction behaviour (Sink Tx) is realized.

Other flume components

In addition to main components in Flume, there are other very important components. These components will be discussed in some detail in this section. The following figure shows all of these components working together:

Figure 19: Other Flume components working together

The following subsection gets deep into working and responsibility of each of the components in the preceding figure (Figure 19). Let's get started and understand how these components will help you in designing the right Flume component arrangement to execute your use case successfully.

Channel processor

As shown in Figure 19 the source sends the events to the channel processor. Every source has its own channel processor and for persisting the event in the channel, the source delegates the work to the channel processor, which actually does the job of persisting according to the channel type.

Interceptor

As seen in the preceding figure, the channel processor then passes the events to the interceptor. Channel...

Context Routing

As explained earlier, event has two main parts namely Header and Payload. Header (Key/Value pair) values can be used and accordingly routing defined. Two components where the routing selection can be decided are:

Channel: A channel can be selected according to the header values. Custom component namely Channel Selector can be written which can have code written to select the channel desired for achieving your use case.
Sink: As before, header values can be used to make decisions to select the right sink. Also, within the sink different operations can be performed by writing custom sink which can do whatever your use case require. There are some default header values which can also be used to do sophisticated stuff for your use case selected.

Basically you can introduce any number of headers and using which your components can do the right stuff. Doing this, the flume components behaves dynamically in all aspects.

Flume working example

In this section, as always throughout this part of the book, we will cover a full working example for the technology; towards the end of this section, there will be a dedicated section that covers how in our use case SCV is implemented, showing real code snippets.

Installation and Configuration

This step details most of the installation stuff that has to be done to make Flume working. This is a pre-requisite to be dealt with.

Step 1: Installing and verifying Flume

In this section we will install Apache Flume and then verify its installation. Follow the given steps for complete installation:

Download the Apache Flume binary distribution with the following command; we will be using the current version of Apache Flume, which is 1.7.0.

wget http://www-us.apache.org/dist/flume/1.7.0/apache-flume-1.7.0-bin.tar.gz

Once downloaded, change the directory to a location where you will want to extract contents by using the following command:

tar -zxvf ${DOWNLOAD_DIRECTORY}/apache-flume...

When to use Flume

Some of the consideration which you can use when choosing Flume for handling different use cases is as follows - choose Flume when you want:

To acquire data from a variety of source and store into Hadoop system
To handle high-velocity and high-volume data into Hadoop system
Reliable delivery of data to the destination
A scalable solution that can run quite easily just by adding more machine to it, when the velocity and volume of data increases
The capability of dynamically configuring the various components in the architecture without incurring any downtime.
To achieve a single point of contact for all the various configurations based on which the overall architecture is functioning

When not to use Flume

In some scenarios, usage of Flume is not the ideal choice. There are other options out there which can be employed to solve those use case and not Flume. Do not choose Flume when:

You need more data processing as against transfer of data. They are more suited for other stream processing technologies.
You need more batch data transfer scenarios (regular batch as against micro-batch).
You need a more available setup with no data loss.
You need a durable message with very high scalability requirements (there isn't a scientific quantitative figure for that though).
You have a huge number of consumers as this has a very high impact on Flume’s scalability.

Even through Flume can be dynamically configured in many cases, it does incur downtime in certain configuration changes (topology changes).

Other options

As always, it doesn't mean that Apache Flume is the only option that can be used to solve the use case problem in hand. We chose Flume for its merit and advantages especially considering our use case of SCV. There are other options which can be considered and these are discussed in brief in his section.

Apache Flink

Apache Flume is used mainly for data acquisition capability. We will be using Flume to transfer data from source systems sending stream data to the messaging layer (for further processing) and all the way into HDFS.

For transferring data all the way to HDFS, Apache Flume is best fit for stream data. However for getting stream data and then processing is one of the main use case for Apache Flink and it does have additional features suited for this.

This doesn't mean that Apache Flink can be used for transferring data to HDFS, it does have the mechanism but there willn't be so many built-in capabilities. It does have many features as against Flume but they are more on...

Summary

First of all, a pat on your back for coming this far. We have completed the technologies that we are going to use in our Data Lake’s first layer namely Data Acquisition Layer. Even though we have covered just two technologies (we willn't say we have covered these topic in depth but we have covered these in some breath and in alignment with our use case implementation) we have covered fair distance in our journey to implement Data Lake for your enterprise.

In this chapter, similar to other chapters in this part, we first set our context by seeing where exactly this technology will be placed in the overall Data Lake architecture. We then gave enough details on why we chose Apache Flume as the technology for handling stream data from source systems.

After that we went deep into Apache Flume and start learning main concepts and working of Flume. We then looked at a full-fledged working example of Flume, in line with our use case of SCV. Before wrapping up we did put in bullet points, when...