Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Flume: Distributed Log Collection for Hadoop

You're reading from  Apache Flume: Distributed Log Collection for Hadoop

Product type Book
Published in Jul 2013
Publisher Packt
ISBN-13 9781782167914
Pages 108 pages
Edition 1st Edition
Languages

The problem with HDFS and streaming data/logs


HDFS isn't a real filesystem, at least not in the traditional sense, and many of the things we take for granted with normal filesystems don't apply here, for example being able to mount it. This makes getting your streaming data into Hadoop a little more complicated.

In a regular Portable Operating System Interface (POSIX) style filesystem, if you open a file and write data, it still exists on disk before the file is closed. That is, if another program opens the same file and starts reading, it will get the data already flushed by the writer to disk. Furthermore, if that writing process is interrupted, any portion that made it to disk is usable (it may be incomplete, but it exists).

In HDFS the file exists only as a directory entry, it shows as having zero length until the file is closed. This means if data is written to a file for an extended period without closing it, a network disconnect with the client will leave you with nothing but an empty file for all your efforts. This may lead you to the conclusion that it would be wise to write small files so you can close them as soon as possible.

The problem is Hadoop doesn't like lots of tiny files. Since the HDFS metadata is kept in memory on the NameNode, the more files you create, the more RAM you'll need to use. From a MapReduce prospective, tiny files lead to poor efficiency. Usually, each mapper is assigned a single block of a file as input (unless you have used certain compression codecs). If you have lots of tiny files, the cost of starting the worker processes can be disproportionally high compared to the data it is processing. This kind of block fragmentation also results in more mapper tasks increasing the overall job run times.

These factors need to be weighed when determining the rotation period to use when writing to HDFS. If the plan is to keep the data around for a short time, then you can lean towards the smaller file size. However, if you plan on keeping the data for very long time, you can either target larger files or do some periodic cleanup to compact smaller files into fewer larger files to make them more MapReduce friendly. After all, you only ingest the data once, but you might run a MapReduce job on that data hundreds or thousands of times.

You have been reading a chapter from
Apache Flume: Distributed Log Collection for Hadoop
Published in: Jul 2013 Publisher: Packt ISBN-13: 9781782167914
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}