Reader small image

You're reading from  Apache Flume: Distributed Log Collection for Hadoop

Product typeBook
Published inFeb 2015
Reading LevelIntermediate
Publisher
ISBN-139781784392178
Edition1st Edition
Languages
Right arrow
Author (1)
Steven Hoffman
Steven Hoffman
author image
Steven Hoffman

Steve Hoffman has 32 years of experience in software development, ranging from embedded software development to the design and implementation of large-scale, service-oriented, object-oriented systems. For the last 5 years, he has focused on infrastructure as code, including automated Hadoop and HBase implementations and data ingestion using Apache Flume. Steve holds a BS in computer engineering from the University of Illinois at Urbana-Champaign and an MS in computer science from DePaul University. He is currently a senior principal engineer at Orbitz Worldwide (http://orbitz.com/). More information on Steve can be found at http://bit.ly/bacoboy and on Twitter at @bacoboy. This is the first update to Steve's first book, Apache Flume: Distributed Log Collection for Hadoop, Packt Publishing.
Read more about Steven Hoffman

Right arrow

Capacity planning


Regardless of how much data you think you have, things will change over time. New projects will pop up and data creation rates for your existing projects will change (up or down). Data volume will usually ebb and flow with the traffic of the day. Finally, the number of servers feeding your Hadoop cluster will change over time.

There are many schools of thought on how much extra storage capacity you should keep in your Hadoop cluster (we use the totally unscientific value of 20 percent, which means that we usually plan for 80 percent full when ordering additional hardware but don't start to panic until we hit the 85-90 percent utilization number). Generally, you want to keep enough extra space so that the failure and/or maintenance of a server or two won't cause the HDFS block replication to consume all the remaining space.

You may also need to set up multiple flows inside a single agent. The source and sink processors are currently single-threaded, so there is some limit...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Apache Flume: Distributed Log Collection for Hadoop
Published in: Feb 2015Publisher: ISBN-13: 9781784392178

Author (1)

author image
Steven Hoffman

Steve Hoffman has 32 years of experience in software development, ranging from embedded software development to the design and implementation of large-scale, service-oriented, object-oriented systems. For the last 5 years, he has focused on infrastructure as code, including automated Hadoop and HBase implementations and data ingestion using Apache Flume. Steve holds a BS in computer engineering from the University of Illinois at Urbana-Champaign and an MS in computer science from DePaul University. He is currently a senior principal engineer at Orbitz Worldwide (http://orbitz.com/). More information on Steve can be found at http://bit.ly/bacoboy and on Twitter at @bacoboy. This is the first update to Steve's first book, Apache Flume: Distributed Log Collection for Hadoop, Packt Publishing.
Read more about Steven Hoffman