Packt+ | Advance your knowledge in tech

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Product type Book

Published in Feb 2015

Publisher

ISBN-13 9781784392178

Pages 178 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Author (1):

Steven Hoffman

Table of Contents (16) Chapters

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Overview and Architecture

A Quick Start Guide to Flume

Channels

Sinks and Sink Processors

Sources and Channel Selectors

Interceptors, ETL, and Routing

Putting It All Together

Monitoring Flume

There Is No Spoon – the Realities of Real-time Distributed Data Collection

Index

Chapter 4. Sinks and Sink Processors

By now, you should have a pretty good idea where the sink fits into the Flume architecture. In this chapter, we will first learn about the most-used sink with Hadoop, the HDFS sink. We will then cover two of the newer sinks that support common Near Real Time (NRT) log processing: the ElasticSearchSink and the MorphlineSolrSink. As you'd expect, the first writes data into Elasticsearch and the latter to Solr. The general architecture of Flume supports many other sinks we won't have space to cover in this book. Some come bundled with Flume and can write to HBase, IRC, and, as we saw in Chapter 2, A Quick Start Guide to Flume, a log4j and file sink. Other sinks are available on the Internet and can be used to write data to MongoDB, Cassandra, RabbitMQ, Redis, and just about any other data store you can think of. If you can't find a sink that suits your needs, you can write one easily by extending the org.apache.flume.sink.AbstractSink class.

HDFS sink

The job of the HDFS sink is to continuously open a file in HDFS, stream data into it, and at some point, close that file and start a new one. As we discussed in Chapter 1, Overview and Architecture, the time between files rotations must be balanced with how quickly files are closed in HDFS, thus making the data visible for processing. As we've discussed, having lots of tiny files for input will make your MapReduce jobs inefficient.

To use the HDFS sink, set the type parameter on your named sink to hdfs.

agent.sinks.k1.type=hdfs

This defines a HDFS sink named k1 for the agent named agent. There are some additional parameters you must specify, starting with the path in HDFS you want to write the data to:

agent.sinks.k1.hdfs.path=/path/in/hdfs

This HDFS path, like most file paths in Hadoop, can be specified in three different ways: absolute, absolute with server name, and relative. These are all equivalent (assuming your Flume agent is run as the flume user):

Compression codecs

Codecs (Coder/Decoders) are used to compress and decompress data using various compression algorithms. Flume supports gzip, bzip2, lzo, and snappy, although you might have to install lzo yourself, especially if you are using a distribution such as CDH, due to licensing issues.

If you want to specify compression for your data, set the hdfs.codeC property if you want the HDFS sink to write compressed files. The property is also used as the file suffix for the files written to HDFS. For example, if you specify the following, all files that are written will have a .gzip extension, so you don't need to specify the hdfs.fileSuffix property in this case:

agent.sinks.k1.hdfs.codeC=gzip

The codec you choose to use will require some research on your part. There are arguments for using gzip or bzip2 for their higher compression ratios at the cost of longer compression times, especially if your data is written once but will be read hundreds or thousands of times. On the other hand,...

Event Serializers

An Event Serializer is the mechanism by which a FlumeEvent is converted into another format for output. It is similar in function to the Layout class in log4j. By default, the text serializer, which outputs just the Flume event body, is used. There is another serializer, header_and_text, which outputs both the headers and the body. Finally, there is an avro_event serializer that can be used to create an Avro representation of the event. If you write your own, you'd use the implementation's fully qualified class name as the serializer property value.

Text output

As mentioned previously, the default serializer is the text serializer. This will output only the Flume event body, with the headers discarded. Each event has a newline character appender unless you override this default behavior by setting the serializer.appendNewLine property to false.

absolute

/Users/flume...

Key	Required	Type	Default
`Serializer`	No	`String`	`text`
`serializer.appendNewLine`	No	`boolean`	`true`

Text with headers...

Sink groups

In order to remove single points of failures in your data processing pipeline, Flume has the ability to send events to different sinks using either load balancing or failover. In order to do this, we need to introduce a new concept called a sink group. A sink group is used to create a logical grouping of sinks. The behavior of this grouping is dictated by something called the sink processor, which determines how events are routed.

There is a default sink processor that contains a single sink which is used whenever you have a sink that isn't part of any sink group. Our Hello, World! example in Chapter 2, A Quick Start Guide to Flume, used the default sink processor. No special configuration is required for single sinks.

In order for Flume to know about the sink groups, there is a new top-level agent property called sinkgroups. Similar to sources, channels, and sinks, you prefix the property with the agent name:

agent.sinkgroups=sg1

Here, we have defined a sink group called sg1 for...

MorphlineSolrSink

HDFS is not the only useful place to send your logs and data. Solr is a popular real-time search platform used to index large amounts of data, so full text searching can be performed almost instantaneously. Hadoop's horizontal scalability creates an interesting problem for Solr, as there is now more data than a single instance can handle. For this reason, a horizontally scalable version of Solr was created, called SolrCloud. Cloudera's Search product is also based on SolrCloud, so it should be no surprise that Flume developers created a new sink specifically to write streaming data into Solr.

Like most streaming data flows, you not only transport the data, but you also often reformat it into a form more consumable to the target of the flow. Typically, this is done in a Flume-only workflow by applying one or more interceptors just prior to the sink writing the data to the target system. This sink uses the Morphline engine to transform the data, instead of interceptors.

Internally...

ElasticSearchSink

Another common target to stream data to be searched in NRT is Elasticsearch. Elasticsearch is also a clustered searching platform based on Lucene, like Solr. It is often used along with the logstash project (to create structured logs) and the Kibana project (a web UI for searches). This trio is often referred to as the acronym ELK (Elasticsearch/Logstash/Kibana).

Note

Here are the project home pages for the ELK stack that can give you a much better overview than I can in a few short pages:

Elasticsearch: http://elasticsearch.org/
Logstash: http://logstash.net/
Kibana: http://www.elasticsearch.org/overview/kibana/

In Elasticsearch, data is grouped into indices. You can think of these as being equivalent to databases in a single MySQL installation. The indices are composed of types (similar to tables in databases), which are made up of documents. A document is like a single row in a database, so, each Flume event will become a single document in ElasticSearch. Documents have...

Summary

In this chapter, we covered the HDFS sink in depth, which writes streaming data into HDFS. We covered how Flume can separate data into different HDFS paths based on time or contents of Flume headers. Several file-rolling techniques were also discussed, including time rotation, event count rotation, size rotation, and rotation on idle only.

Compression was discussed as a means to reduce storage requirements in HDFS, and should be used when possible. Besides storage savings, it is often faster to read a compressed file and decompress in memory than it is to read an uncompressed file. This will result in performance improvements in MapReduce jobs run on this data. The splitability of compressed data was also covered as a factor to decide when and which compression algorithm to use.

Event Serializers were introduced as the mechanism by which Flume events are converted into an external storage format, including text (body only), text and headers (headers and body), and Avro serialization...