If you are reading this book, chances are you are swimming in mountains of data. Creating mountains of data has become very easy, thanks to Facebook, Twitter, Amazon, digital cameras and camera phones, YouTube, Google, and just about anything else you can think of connected to the Internet. As a provider of a website, 10 years ago, your application logs were only used to help you troubleshoot your website. Today, that same data can provide valuable insight into your business and customers if you know how to pan gold out of your river of data.
Furthermore, since you are reading this book, you are also aware that Hadoop was created to solve (partially) the problem of sifting through mountains of data. Of course, this only works if you can reliably load your Hadoop cluster with data for your data scientists to pick apart.
% hadoop fs --put data.csv .
This works great when you have all your data neatly packaged and ready to upload.
But your website is creating data all the time. How often should you batch load data to HDFS? Daily? Hourly? Whatever processing period you choose, eventually somebody always asks, "can you get me the data sooner?" What you really need is a solution that can deal with streaming logs/data.
Turns out you aren't alone in this need. Cloudera, a provider of professional services for Hadoop as well as their own distribution of Hadoop, saw this need over and over while working with their customers. Flume was created to meet this need and create a standard, simple, robust, flexible, and extensible tool for data ingestion into Hadoop.
Flume was first introduced in Cloudera's CDH3 Distribution in 2011. It consisted of a federation of worker daemons (agents) configured from a centralized master (or masters) via Zookeeper (a federated configuration and coordination system). From the master you could check agent status in a Web UI, as well as push out configuration centrally from the UI or via a command line shell (both really communicating via Zookeeper to the worker agents).
Data could be sent in one of the three modes, namely, best effort (BE), disk failover (DFO), and end-to-end (E2E). The masters were used for the end-to-end (E2E) mode acknowledgements and multi-master configuration never really matured so usually you had only one master making it a central point of failure for E2E data flows. Best effort is just what it sounds like—the agent would try and send the data, but if it couldn't, the data would be discarded. This mode is good for things like metrics where gaps can easily be tolerated, as new data is just a second away. Disk failover mode stores undeliverable data to the local disk (or sometimes a local database) and keeps retrying until the data can be delivered to the next recipient in your data flow. This is handy for those planned (or unplanned) outages as long as you have sufficient local disk space to buffer the load.
In June of 2011, Cloudera moved control of the Flume project to the Apache foundation. It came out of incubator status a year later in 2012. During that incubation year, work had already begun to refactor Flume under the Star Trek Themed tag, Flume-NG (Flume the Next Generation).
There were many reasons to why Flume was refactored. If you are interested in the details you can read about it at https://issues.apache.org/jira/browse/FLUME-728. What started as a refactoring branch eventually became the main line of development as Flume 1.X.
The most obvious change in Flume 1.X is that the centralized configuration master/masters and Zookeeper are gone. The configuration in Flume 0.9 was overly verbose and mistakes were easy to make. Furthermore, centralized configuration was really outside the scope of Flume's goals. Centralized configuration was replaced with a simple on-disk configuration file (although the configuration provider is pluggable so that it can be replaced). These configuration files are easily distributed using tools such as cf-engine, chef, and puppet. If you are using a Cloudera Distribution, take a look at Cloudera Manager to manage your configurations—their licensing was recently changed to lift the node limit so it may be an attractive option for you. Be sure you don't manage these configurations manually or you'll be editing those files manually forever.
Another major difference in Flume 1.X is that the reading of input data and the writing of output data are now handled by different worker threads (called Runners). In Flume 0.9, the input thread also did the writing to the output (except for failover retries). If the output writer was slow (rather than just failing outright), it would block Flume's ability to ingest data. This new asynchronous design leaves the input thread blissfully unaware of any downstream problem.
The version of Flume covered in this book is 1.3.1 (current at the time of this book's writing).
HDFS isn't a real filesystem, at least not in the traditional sense, and many of the things we take for granted with normal filesystems don't apply here, for example being able to mount it. This makes getting your streaming data into Hadoop a little more complicated.
In a regular Portable Operating System Interface (POSIX) style filesystem, if you open a file and write data, it still exists on disk before the file is closed. That is, if another program opens the same file and starts reading, it will get the data already flushed by the writer to disk. Furthermore, if that writing process is interrupted, any portion that made it to disk is usable (it may be incomplete, but it exists).
In HDFS the file exists only as a directory entry, it shows as having zero length until the file is closed. This means if data is written to a file for an extended period without closing it, a network disconnect with the client will leave you with nothing but an empty file for all your efforts. This may lead you to the conclusion that it would be wise to write small files so you can close them as soon as possible.
The problem is Hadoop doesn't like lots of tiny files. Since the HDFS metadata is kept in memory on the NameNode, the more files you create, the more RAM you'll need to use. From a MapReduce prospective, tiny files lead to poor efficiency. Usually, each mapper is assigned a single block of a file as input (unless you have used certain compression codecs). If you have lots of tiny files, the cost of starting the worker processes can be disproportionally high compared to the data it is processing. This kind of block fragmentation also results in more mapper tasks increasing the overall job run times.
These factors need to be weighed when determining the rotation period to use when writing to HDFS. If the plan is to keep the data around for a short time, then you can lean towards the smaller file size. However, if you plan on keeping the data for very long time, you can either target larger files or do some periodic cleanup to compact smaller files into fewer larger files to make them more MapReduce friendly. After all, you only ingest the data once, but you might run a MapReduce job on that data hundreds or thousands of times.
The Flume agent's architecture can be viewed in this simple diagram. An input is called a source and an output is called a sink. A channel provides the glue between a source and a sink. All of these run inside a daemon called an agent.
The headers are key/value pairs that can be used to make routing decisions or carry other structured information (such as the timestamp of the event or hostname of the server where the event originated). You can think of it as serving the same function as HTTP headers—a way to pass additional information that is distinct from the body.
The body is an array of bytes that contains the actual payload. If your input is comprised of tailed logfiles, the array is most likely a UTF-8 encoded
String containing a line of text.
Flume may add additional headers automatically (for example, when a source adds the hostname where the data is sourced or creating an event's timestamp), but the body is mostly untouched unless you edit it en-route using interceptors.
An interceptor is a point in your data flow where you can inspect and alter Flume events. You can chain zero or more interceptors after a source creates an event or before a sink sends the event wherever it is destined. If you are familiar with the AOP Spring Framework, it is similar to a
MethodInterceptor. In Java Servlets it is similar to a
ServletFilter. Here's an example of what using four chained interceptors on a source might look like:
Channel selectors are responsible for how data moves from a source to one or more channels. Flume comes packaged with two channel selectors, which cover most use cases you might have, although you can write your own if needed. A replicating channel selector (the default) simply puts a copy of the event into each channel assuming you have configured more than one. In contrast, a multiplexing channel selector can write to different channels depending on certain header information. Combined with Interceptor logic, this duo forms the foundation for routing input to different channels.
Finally, a sink processor is the mechanism by which you can create failover paths for your sinks or load balance events across multiple sinks from a channel.
You can chain your Flume agents depending on your particular use case. For example, you may want to insert an agent in a tiered fashion to limit the number of clients trying to connect directly to your Hadoop cluster. More likely your source machines don't have sufficient disk space to deal with a prolonged outage or maintenance window, so you create a tier with lots of disk space between your sources and your Hadoop cluster.
In the following diagram you can see there are two places data is created (on the left) and two final destinations for the data (the HDFS and ElasticSearch cloud bubbles on the right). To make things more interesting, let's say one of the machines generates two kinds of data (let's call them square and triangle data). You can see in the lower-left agent we use a multiplexing channel selector to split the two kinds of data into different channels. The rectangle channel is then routed to the agent in the upper-right corner (along with the data coming from the upper-left agent). The combined volume of events is written together in HDFS in datacenter 1. Meanwhile the triangle data is sent to the agent that writes to ElasticSearch in datacenter 2. Keep in mind that the data transformations can occur after any source or before any sink. How all of these components can be used to build complicated data workflows will become clear as the book proceeds.
In this chapter we discussed the problem that Flume is attempting to solve; getting data into your Hadoop cluster for data processing in an easily configured and reliable way. We also discussed the Flume agent and its logical components including: events, sources, channel selectors, channels, sink processors, and sinks.
The next chapter will cover these in more detail, specifically the most commonly used implementations of each. Like all good open source projects, almost all of these components are extensible if the bundled ones don't do what you need them to do.