Reader small image

You're reading from  Apache Hive Essentials

Product typeBook
Published inFeb 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781783558575
Edition1st Edition
Languages
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Batch, real-time, and stream processing


Batch processing is used to process data in batches and it reads data input, processes it, and writes it to the output. Apache Hadoop is the most well-known and popular open source implementation of batch processing and a distributed system using the MapReduce paradigm. The data is stored in a shared and distributed filesystem called Hadoop Distributed File System (HDFS), divided into splits, which are the logical data divisions for MapReduce processing. To process these splits using the MapReduce paradigm, the map task reads the splits and passes all of its key/value pairs to a map function and writes the results to intermediate files. After the map phase is completed, the reducer reads intermediate files and passes it to the reduce function. Finally, the reduce task writes results to the final output files. The advantages of the MapReduce model include making distributed programming easier, near-linear speed up, good scalability, as well as fault tolerance. The disadvantage of this batch processing model is being unable to execute recursive or iterative jobs. In addition, the obvious batch behavior is that all inputs must be ready by map before the reduce job starts, which makes MapReduce unsuitable for online and stream processing use cases.

Real-time processing is to process data and get the result almost immediately. This concept in the area of real-time ad hoc queries over big data was first implemented in Dremel by Google. It uses a novel columnar storage format for nested structures with fast index and scalable aggregation algorithms for computing query results in parallel instead of batch sequences. These two techniques are the major characters for real-time processing and are used by similar implementations, such as Cloudera Impala, Facebook Presto, Apache Drill, and Hive on Tez powered by Stinger whose effort is to make a 100x performance improvement over Apache Hive. On the other hand, in-memory computing no doubt offers other solutions for real-time processing. In-memory computing offers very high bandwidth, which is more than 10 gigabytes/second, compared to hard disks' 200 megabytes/second. Also, the latency is comparatively lower, nanoseconds versus milliseconds, compared to hard disks. With the price of RAM going lower and lower each day, in-memory computing is more affordable as real-time solutions, such as Apache Spark, which is a popular open source implementation of in-memory computing. Spark can be easily integrated with Hadoop and the resilient distributed dataset can be generated from data sources such as HDFS and HBase for efficient caching.

Stream processing is to continuously process and act on the live stream data to get a result. In stream processing, there are two popular frameworks: Storm (https://storm.apache.org/) from Twitter and S4 (http://incubator.apache.org/s4/) from Yahoo!. Both the frameworks run on the Java Virtual Machine (JVM) and both process keyed streams. In terms of the programming model, S4 is a program defined as a graph of Processing Elements (PE), small subprograms, and S4 instantiates a PE per key. In short, Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework.

Previous PageNext Page
You have been reading a chapter from
Apache Hive Essentials
Published in: Feb 2015Publisher: PacktISBN-13: 9781783558575
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du