Apache Flume: Distributed Log Collection for Hadoop

If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it’s a complete step-by-step guide on making the service work for you.

Apache Flume: Distributed Log Collection for Hadoop

Steve Hoffman

If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it’s a complete step-by-step guide on making the service work for you.
Mapt Subscription
FREE
$29.99/m after trial
eBook
$15.40
RRP $21.99
Save 29%
Print + eBook
$36.99
RRP $36.99
What do I get with a Mapt Pro subscription?
  • Unlimited access to all Packt’s 5,000+ eBooks and Videos
  • Early Access content, Progress Tracking, and Assessments
  • 1 Free eBook or Video to download and keep every month after trial
What do I get with an eBook?
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with Print & eBook?
  • Get a paperback copy of the book delivered to you
  • Download this book in EPUB, PDF, MOBI formats
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
What do I get with a Video?
  • Download this Video course in MP4 format
  • DRM FREE - read and interact with your content when you want, where you want, and how you want
  • Access this title in the Mapt reader
$0.00
$15.40
$36.99
$29.99p/m after trial
RRP $21.99
RRP $36.99
Subscription
eBook
Print + eBook
Start 30 Day Trial
Subscribe and access every Packt eBook & Video.
 
  • 5,000+ eBooks & Videos
  • 50+ New titles a month
  • 1 Free eBook/Video to keep every month
Start Free Trial
 
Code Files
Preview in Mapt

Book Details

ISBN 139781782167914
Paperback108 pages

Book Description

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

Table of Contents

Chapter 1: Overview and Architecture
Flume 0.9
Flume 1.X (Flume-NG)
The problem with HDFS and streaming data/logs
Sources, channels, and sinks
Flume events
Summary
Chapter 2: Flume Quick Start
Downloading Flume
Flume configuration file overview
Starting up with "Hello World"
Summary
Chapter 3: Channels
Memory channel
File channel
Summary
Chapter 4: Sinks and Sink Processors
HDFS sink
Compression codecs
Event serializers
Sink groups
Summary
Chapter 5: Sources and Channel Selectors
The problem with using tail
The exec source
The spooling directory source
Syslog sources
Channel selectors
Summary
Chapter 6: Interceptors, ETL, and Routing
Interceptors
Tiering data flows
Routing
Summary
Chapter 7: Monitoring Flume
Monitoring the agent process
Monitoring performance metrics
Summary
Chapter 8: There Is No Spoon – The Realities of Real-time Distributed Data Collection
Transport time versus log time
Time zones are evil
Capacity planning
Considerations for multiple data centers
Compliance and data expiry
Summary

What You Will Learn

  • Understand the Flume architecture
  • Download and install open source Flume from Apache
  • Discover when to use a memory or file-backed channel
  • Understand and configure the Hadoop File System (HDFS) sink
  • Learn how to use sink groups to create redundant data flows
  • Configure and use various sources for ingesting data
  • Inspect data records and route to different or multiple destinations based on payload content
  • Transform data en-route to Hadoop
  • Monitor your data flows

Authors

Table of Contents

Chapter 1: Overview and Architecture
Flume 0.9
Flume 1.X (Flume-NG)
The problem with HDFS and streaming data/logs
Sources, channels, and sinks
Flume events
Summary
Chapter 2: Flume Quick Start
Downloading Flume
Flume configuration file overview
Starting up with "Hello World"
Summary
Chapter 3: Channels
Memory channel
File channel
Summary
Chapter 4: Sinks and Sink Processors
HDFS sink
Compression codecs
Event serializers
Sink groups
Summary
Chapter 5: Sources and Channel Selectors
The problem with using tail
The exec source
The spooling directory source
Syslog sources
Channel selectors
Summary
Chapter 6: Interceptors, ETL, and Routing
Interceptors
Tiering data flows
Routing
Summary
Chapter 7: Monitoring Flume
Monitoring the agent process
Monitoring performance metrics
Summary
Chapter 8: There Is No Spoon – The Realities of Real-time Distributed Data Collection
Transport time versus log time
Time zones are evil
Capacity planning
Considerations for multiple data centers
Compliance and data expiry
Summary

Book Details

ISBN 139781782167914
Paperback108 pages
Read More

Read More Reviews

Recommended for You

Hadoop MapReduce Cookbook Book Cover
Hadoop MapReduce Cookbook
$ 29.99
$ 21.00
Big Data Analytics with R and Hadoop Book Cover
Big Data Analytics with R and Hadoop
$ 29.99
$ 21.00
Hadoop Real-World Solutions Cookbook Book Cover
Hadoop Real-World Solutions Cookbook
$ 29.99
$ 21.00
Storm Real-time Processing Cookbook Book Cover
Storm Real-time Processing Cookbook
$ 29.99
$ 21.00
Hadoop Beginner's Guide Book Cover
Hadoop Beginner's Guide
$ 29.99
$ 21.00
Building Machine Learning Systems with Python Book Cover
Building Machine Learning Systems with Python
$ 29.99
$ 6.00