Packt+ | Advance your knowledge in tech

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Product type Book

Published in Feb 2015

Publisher

ISBN-13 9781784392178

Pages 178 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Author (1):

Steven Hoffman

Table of Contents (16) Chapters

Apache Flume: Distributed Log Collection for Hadoop Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Overview and Architecture

A Quick Start Guide to Flume

Channels

Sinks and Sink Processors

Sources and Channel Selectors

Interceptors, ETL, and Routing

Putting It All Together

Monitoring Flume

There Is No Spoon – the Realities of Real-time Distributed Data Collection

Index

Flume 0.9

Flume was first introduced in Cloudera's CDH3 distribution in 2011. It consisted of a federation of worker daemons (agents) configured from a centralized master (or masters) via Zookeeper (a federated configuration and coordination system). From the master, you could check the agent status in a web UI as well as push out configuration centrally from the UI or via a command-line shell (both really communicating via Zookeeper to the worker agents).

Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and End-to-End (E2E). The masters were used for the E2E mode acknowledgements and multimaster configuration never really matured, so you usually only had one master, making it a central point of failure for E2E data flows. The BE mode is just what it sounds like: the agent would try to send the data, but if it couldn't, the data would be discarded. This mode is good for things such as metrics, where gaps can easily be tolerated, as new data is just a second away. The DFO mode stores undeliverable data to the local disk (or sometimes, a local database) and would keep retrying until the data could be delivered to the next recipient in your data flow. This is handy for those planned (or unplanned) outages, as long as you have sufficient local disk space to buffer the load.

In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation. It came out of the incubator status a year later in 2012. During the incubation year, work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG (Flume the Next Generation).

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Table of Contents (16) Chapters

Flume 0.9

Authors (1)

Personalised recommendations for you