Apache Flume: Distributed Log Collection for Hadoop


Apache Flume: Distributed Log Collection for Hadoop
eBook: $21.99
Formats: PDF, PacktLib, ePub and Mobi formats
$18.69
save 15%!
Print + free eBook + free PacktLib access to the book: $58.98    Print cover: $36.99
$36.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Integrate Flume with your data sources
  • Transcode your data en-route in Flume
  • Route and separate your data using regular expression matching
  • Configure failover paths and load-balancing to remove single points of failure
  • Utilize Gzip Compression for files written to HDFS

Book Details

Language : English
Paperback : 108 pages [ 235mm x 191mm ]
Release Date : July 2013
ISBN : 1782167919
ISBN 13 : 9781782167914
Author(s) : Steve Hoffman
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Open Source

Table of Contents

Preface
Chapter 1: Overview and Architecture
Chapter 2: Flume Quick Start
Chapter 3: Channels
Chapter 4: Sinks and Sink Processors
Chapter 5: Sources and Channel Selectors
Chapter 6: Interceptors, ETL, and Routing
Chapter 7: Monitoring Flume
Chapter 8: There Is No Spoon – The Realities of Real-time Distributed Data Collection
Index
  • Chapter 1: Overview and Architecture
    • Flume 0.9
    • Flume 1.X (Flume-NG)
    • The problem with HDFS and streaming data/logs
    • Sources, channels, and sinks
    • Flume events
      • Interceptors, channel selectors, and sink processors
      • Tiered data collection (multiple flows and/or agents)
    • Chapter 2: Flume Quick Start
      • Downloading Flume
        • Flume in Hadoop distributions
      • Flume configuration file overview
      • Starting up with "Hello World"
      • Summary
        • Chapter 4: Sinks and Sink Processors
          • HDFS sink
            • Path and filename
            • File rotation
          • Compression codecs
          • Event serializers
            • Text output
            • Text with headers
            • Apache Avro
            • File type
              • Sequence file
              • Data stream
              • Compressed stream
            • Timeouts and workers
          • Sink groups
            • Load balancing
            • Failover
          • Summary
          • Chapter 5: Sources and Channel Selectors
            • The problem with using tail
            • The exec source
            • The spooling directory source
            • Syslog sources
              • The syslog UDP source
              • The syslog TCP source
              • The multiport syslog TCP source
            • Channel selectors
              • Replicating
              • Multiplexing
            • Summary
            • Chapter 6: Interceptors, ETL, and Routing
              • Interceptors
                • Timestamp
                • Host
                • Static
                • Regular expression filtering
                • Regular expression extractor
                • Custom interceptors
              • Tiering data flows
                • Avro Source/Sink
                • Command-line Avro
                • Log4J Appender
                • The Load Balancing Log4J Appender
              • Routing
              • Summary
              • Chapter 7: Monitoring Flume
                • Monitoring the agent process
                  • Monit
                  • Nagios
                • Monitoring performance metrics
                  • Ganglia
                  • The internal HTTP server
                  • Custom monitoring hooks
                • Summary

                  Steve Hoffman

                  Steve Hoffman has 30 years of software development experience and holds a B.S. in computer engineering from the University of Illinois Urbana-Champaign and a M.S. in computer science from the DePaul University. He is currently a Principal Engineer at Orbitz Worldwide. More information on Steve can be found at http://bit.ly/bacoboy or on Twitter @bacoboy. This is Steve's first book.
                  Sorry, we don't have any reviews for this title yet.

                  Submit Errata

                  Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                  Errata

                  - 1 submitted: last submission 12 Mar 2014

                  Type: Grammar | Page number: 51

                   

                  In the "The spooling directory source" section:

                  It also assumes that filenames never change; otherwise, the source would loose its place on restarts as to which files have been sent and which have not.

                  Should be:

                  It also assumes that filenames never change; otherwise, the source would lose its place on restarts as to which files have been sent and which have not.

                  Sample chapters

                  You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                  Frequently bought together

                  Apache Flume: Distributed Log Collection for Hadoop +    Hadoop Operations and Cluster Management Cookbook =
                  50% Off
                  the second eBook
                  Price for both: ₨314.80

                  Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                  What you will learn from this book

                  • Understand the Flume architecture
                  • Download and install open source Flume from Apache
                  • Discover when to use a memory or file-backed channel
                  • Understand and configure the Hadoop File System (HDFS) sink
                  • Learn how to use sink groups to create redundant data flows
                  • Configure and use various sources for ingesting data
                  • Inspect data records and route to different or multiple destinations based on payload content
                  • Transform data en-route to Hadoop
                  • Monitor your data flows

                  In Detail

                  Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

                  Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

                  Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

                  It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

                  By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

                  Approach

                  A starter guide that covers Apache Flume in detail.

                  Who this book is for

                  Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

                  Code Download and Errata
                  Packt Anytime, Anywhere
                  Register Books
                  Print Upgrades
                  eBook Downloads
                  Video Support
                  Contact Us
                  Awards Voting Nominations Previous Winners
                  Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                  Resources
                  Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software