Packt+ | Advance your knowledge in tech

You're reading from Apache Flume: Distributed Log Collection for Hadoop

Product typeBook

Published inFeb 2015

Reading LevelIntermediate

Publisher

ISBN-139781784392178

Edition1st Edition

Languages

Java

Tools

Apache Flume

Concepts

Data Processing

Author (1)

Steven Hoffman

Chapter 3. Channels

In Flume, a channel is the construct used between sources and sinks. It provides a buffer for your in-flight events after they are read from sources until they can be written to sinks in your data processing pipelines.

The primary types we'll cover here are a memory-backed/nondurable channel and a local-filesystem-backed/durable channel. Starting with Flume 1.5, an experimental hybrid memory and file channel called the Spillable Memory Channel is introduced. The durable file channel flushes all changes to disk before acknowledging the receipt of the event to the sender. This is considerably slower than using the nondurable memory channel, but it provides recoverability in the event of system or Flume agent restarts. Conversely, the memory channel is much faster, but failure results in data loss and it has much lower storage capacity when compared to the multiterabyte disks backing the file channel. This is why the Spillable Memory Channel was created. In theory, you get...

The memory channel

A memory channel, as expected, is a channel where in-flight events are stored in memory. As memory is (usually) orders of magnitude faster than the disk, events can be ingested much more quickly, resulting in reduced hardware needs. The downside of using this channel is that an agent failure (hardware problem, power outage, JVM crash, Flume restart, and so on) results in the loss of data. Depending on your use case, this might be perfectly fine. System metrics usually fall into this category, as a few lost data points isn't the end of the world. However, if your events represent purchases on your website, then a memory channel would be a poor choice.

To use the memory channel, set the type parameter on your named channel to memory.

agent.channels.c1.type=memory

This defines a memory channel named c1 for the agent named agent.

Here is a table of configuration parameters you can adjust from the default values:

The file channel

A file channel is a channel that stores events to the local filesystem of the agent. Though it's slower than the memory channel, it provides a durable storage path that can survive most issues and should be used in use cases where a gap in your data flow is undesirable.

This durability is provided by a combination of a Write Ahead Log (WAL) and one or more file storage directories. The WAL is used to track all input and output from the channel in an atomically safe way. This way, if the agent is restarted, the WAL can be replayed to make sure all the events that came into the channel (puts) have been written out (takes) before the stored data can be purged from the local filesystem.

Additionally, the file channel supports the encryption of data written to the filesystem if your data handling policy requires that all data on the disk (even temporarily) be encrypted. I won't cover this here, but should you need it, there is an example in the Flume User Guide (http://flume.apache...

Spillable Memory Channel

Introduced in Flume 1.5, the Spillable Memory Channel is a channel that acts like a memory channel until it is full. At that point, it acts like a file channel that is configured with a much larger capacity than its memory counterpart but runs at the speed of your disks (which means orders of magnitude slower).

Note

The Spillable Memory Channel is still considered experimental. Use it at your own risk!

I have mixed feelings about this new channel type. On the surface, it seems like a good idea, but in practice, I can see problems. Specifically, having a variable channel speed that changes depending on how downstream entities in your data pipe behave makes for difficult capacity planning. As a memory channel is used under good conditions, this implies that the data contained in it can be lost. So why would I go through extra trouble to save some of it to the disk? The data is either very important for me to spool it to disk with a file-backed channel, or it's less important...

Summary

In this chapter, we covered the two channel types you are most likely to use in your data processing pipelines.

The memory channel offers speed at the cost of data loss in the event of failure. Alternatively, the file channel provides a more reliable transport in that it can tolerate agent failures and restarts at a performance cost.

You will need to decide which channel is appropriate for your use cases. When trying to decide whether a memory channel is appropriate, ask yourself what the monetary cost is if you lose some data. Weigh that against the additional costs of more hardware to cover the difference in performance when deciding if you need a durable channel after all. Another consideration is whether or not the data can be resent. Not all data you might ingest into Hadoop will come from streaming application logs. If you receive "daily downloads" of data, you can get away with using a memory channel because if you encounter a problem, you can always rerun the import.

Finally...

The rest of the chapter is locked

You have been reading a chapter from

Apache Flume: Distributed Log Collection for Hadoop

Published in: Feb 2015Publisher: ISBN-13: 9781784392178

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Steven Hoffman

Steve Hoffman has 32 years of experience in software development, ranging from embedded software development to the design and implementation of large-scale, service-oriented, object-oriented systems. For the last 5 years, he has focused on infrastructure as code, including automated Hadoop and HBase implementations and data ingestion using Apache Flume. Steve holds a BS in computer engineering from the University of Illinois at Urbana-Champaign and an MS in computer science from DePaul University. He is currently a senior principal engineer at Orbitz Worldwide (http://orbitz.com/). More information on Steve can be found at http://bit.ly/bacoboy and on Twitter at @bacoboy. This is the first update to Steve's first book, Apache Flume: Distributed Log Collection for Hadoop, Packt Publishing.
Read more about Steven Hoffman

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Key	Required	Type	Default
`type`	Yes	`String`	`memory`
`capacity...`