Codecs (Coder/Decoders) are used to compress and decompress data using various compression algorithms. Flume supports gzip
, bzip2
, lzo
, and snappy
, although you might have to install lzo
yourself, especially if you are using a distribution such as CDH, due to licensing issues.
If you want to specify compression for your data, set the hdfs.codeC
property if you want the HDFS sink to write compressed files. The property is also used as the file suffix for the files written to HDFS. For example, if you specify the following, all files that are written will have a .gzip
extension, so you don't need to specify the hdfs.fileSuffix
property in this case:
The codec you choose to use will require some research on your part. There are arguments for using gzip
or bzip2
for their higher compression ratios at the cost of longer compression times, especially if your data is written once but will be read hundreds or thousands of times. On the other hand,...
An Event Serializer is the mechanism by which a FlumeEvent
is converted into another format for output. It is similar in function to the Layout
class in log4j. By default, the text
serializer, which outputs just the Flume event body, is used. There is another serializer, header_and_text
, which outputs both the headers and the body. Finally, there is an avro_event
serializer that can be used to create an Avro representation of the event. If you write your own, you'd use the implementation's fully qualified class name as the serializer
property value.
As mentioned previously, the default serializer is the text
serializer. This will output only the Flume event body, with the headers discarded. Each event has a newline character appender unless you override this default behavior by setting the serializer.appendNewLine
property to false
.
In order to remove single points of failures in your data processing pipeline, Flume has the ability to send events to different sinks using either load balancing or failover. In order to do this, we need to introduce a new concept called a sink group. A sink group is used to create a logical grouping of sinks. The behavior of this grouping is dictated by something called the
sink processor, which determines how events are routed.
There is a default sink processor that contains a single sink which is used whenever you have a sink that isn't part of any sink group. Our Hello, World!
example in Chapter 2, A Quick Start Guide to Flume, used the default sink processor. No special configuration is required for single sinks.
In order for Flume to know about the sink groups, there is a new top-level agent property called sinkgroups
. Similar to sources, channels, and sinks, you prefix the property with the agent name:
Here, we have defined a sink group called sg1
for...
HDFS is not the only useful place to send your logs and data. Solr is a popular real-time search platform used to index large amounts of data, so full text searching can be performed almost instantaneously. Hadoop's horizontal scalability creates an interesting problem for Solr, as there is now more data than a single instance can handle. For this reason, a horizontally scalable version of Solr was created, called SolrCloud. Cloudera's Search product is also based on SolrCloud, so it should be no surprise that Flume developers created a new sink specifically to write streaming data into Solr.
Like most streaming data flows, you not only transport the data, but you also often reformat it into a form more consumable to the target of the flow. Typically, this is done in a Flume-only workflow by applying one or more interceptors just prior to the sink writing the data to the target system. This sink uses the Morphline engine to transform the data, instead of interceptors.
Internally...
Another common target to stream data to be searched in NRT is Elasticsearch. Elasticsearch is also a clustered searching platform based on Lucene, like Solr. It is often used along with the logstash project (to create structured logs) and the Kibana project (a web UI for searches). This trio is often referred to as the acronym ELK (Elasticsearch/Logstash/Kibana).
Note
Here are the project home pages for the ELK stack that can give you a much better overview than I can in a few short pages:
In Elasticsearch, data is grouped into indices. You can think of these as being equivalent to databases in a single MySQL installation. The indices are composed of types (similar to tables in databases), which are made up of documents. A document is like a single row in a database, so, each Flume event will become a single document in ElasticSearch. Documents have...
In this chapter, we covered the HDFS sink in depth, which writes streaming data into HDFS. We covered how Flume can separate data into different HDFS paths based on time or contents of Flume headers. Several file-rolling techniques were also discussed, including time rotation, event count rotation, size rotation, and rotation on idle only.
Compression was discussed as a means to reduce storage requirements in HDFS, and should be used when possible. Besides storage savings, it is often faster to read a compressed file and decompress in memory than it is to read an uncompressed file. This will result in performance improvements in MapReduce jobs run on this data. The splitability of compressed data was also covered as a factor to decide when and which compression algorithm to use.
Event Serializers were introduced as the mechanism by which Flume events are converted into an external storage format, including text (body only), text and headers (headers and body), and Avro serialization...