You're reading from Apache Flume: Distributed Log Collection for Hadoop
Let's download Flume from http://flume.apache.org/. Look for the download link in the side navigation. You'll see two compressed .tar
archives available along with the checksum and GPG signature files used to verify the archives. Instructions to verify the download are on the website, so I won't cover them here. Checking the checksum file contents against the actual checksum verifies that the download was not corrupted. Checking the signature file validates that all the files you are downloading (including the checksum and signature) came from Apache and not some nefarious location. Do you really need to verify your downloads? In general, it is a good idea and it is recommended by Apache that you do so. If you choose not to, I won't tell.
The binary distribution archive has bin
in the name, and the source archive is marked with src
. The source archive contains just the Flume source code. The binary distribution is much larger because it contains not only the Flume source...
Now that we've downloaded Flume, let's spend some time going over how to configure an agent.
A Flume agent's default configuration provider uses a simple Java property file of key/value pairs that you pass as an argument to the agent upon startup. As you can configure more than one agent in a single file, you will need to additionally pass an agent identifier (called a name) so that it knows which configurations to use. In my examples where I'm only specifying one agent, I'm going to use the name agent
.
Note
By default, the configuration property file is monitored for changes every 30 seconds. If a change is detected, Flume will attempt to reconfigure itself. In practice, many of the configuration settings cannot be changed after the agent has started. Save yourself some trouble and pass the undocumented --no-reload-conf
argument when starting the agent (except in development situations perhaps).
No technical book would be complete without a Hello, World! example. Here is the configuration file we'll be using:
agent.sources=s1 agent.channels=c1 agent.sinks=k1 agent.sources.s1.type=netcat agent.sources.s1.channels=c1 agent.sources.s1.bind=0.0.0.0 agent.sources.s1.port=12345 agent.channels.c1.type=memory agent.sinks.k1.type=logger agent.sinks.k1.channel=c1
Here, I've defined one agent (called agent
) who has a source named s1
, a channel named c1
, and a sink named k1
.
The s1
source's type is netcat
, which simply opens a socket listening for events (one line of text per event). It requires two parameters: a bind IP and a port number. In this example, we are using 0.0.0.0
for a bind address (the Java convention to specify listen
on any address) and port 12345
. The source configuration also has a parameter called channels
(plural), which is the name of the channel(s) the source will append events to, in this case, c1
. It is plural, because you can configure...
In this chapter, we covered how to download the Flume binary distribution. We created a simple configuration file that included one source writing to one channel, feeding one sink. The source listened on a socket for network clients to connect to and to send it event data. These events were written to an in-memory channel and then fed to a Log4j sink to become the output. We then connected to our listening agent using the Linux netcat utility and sent some string events to our Flume agent's source. Finally, we verified that our Log4j-based sink wrote the events out.
In the next chapter, we'll take a detailed look at the two major channel types you'll most likely use in your data processing workflows: the memory channel and the file channel.
We will also take a look at a new experimental channel, introduced in Version 1.5 of Flume, called the Spillable Memory Channel, which attempts to be a hybrid of the other two.
For each type, we'll discuss all the configuration knobs available to...