Packt+ | Advance your knowledge in tech

You're reading from Apache Hive Essentials. - Second Edition

Product type Book

Published in Jun 2018

Publisher Packt

ISBN-13 9781788995092

Pages 210 pages

Edition 2nd Edition

Languages

Java

Concepts

Data Analysis

Author (1):

Dayong Du

Batch processing is used to process data in batches. It reads data from the input, processes it, and writes it to the output. Apache Hadoop is the most well-known and popular open source implementation of the distributed batch processing system using the MapReduce paradigm. The data is stored in a shared and distributed file system, called Hadoop Distributed File System (HDFS), and divided into splits, which are the logical data divisions for MapReduce processing.

To process these splits using the MapReduce paradigm, the map task reads the splits and passes all of its key/value pairs to a map function, and writes the results to intermediate files. After the map phase is completed, the reducer reads intermediate files sent through the shuffle process and passes them to the reduce function. Finally, the reduce task writes results to the final output files. The advantages of the MapReduce model include making distributed programming easier, near-linear speed-up, good scalability, as well as fault tolerance. The disadvantage of this batch processing model is being unable to execute recursive or iterative jobs. In addition, the obvious batch behavior is that all input must be ready by map before the reduce job starts, which makes MapReduce unsuitable for online and stream-processing use cases.

Real-time processing is used to process data and get the result almost immediately. This concept in the area of real-time ad hoc queries over big data was first implemented in Dremel by Google. It uses a novel columnar storage format for nested structures with fast index and scalable aggregation algorithms for computing query results in parallel instead of batch sequences. These two techniques are the major characters for real-time processing and are used by similar implementations, such as Impala (https://impala.apache.org/), Presto (https://prestodb.io/), and Drill (https://drill.apache.org/), powered by the columnar storage data format, such as Parquet (https://parquet.apache.org/), ORC (https://orc.apache.org/), CarbonData (https://carbondata.apache.org/), and Arrow (https://arrow.apache.org/). On the other hand, in-memory computing no doubt offers faster solutions for real-time processing. In-memory computing offers very high bandwidth, which is more than 10 gigabytes/second, compared to a hard disk's 200 megabytes/second. Also, the latency is comparatively lower, nanoseconds versus milliseconds, compared to hard disks. With the price of RAM getting lower and lower each day, in-memory computing is more affordable as a real-time solution, such as Apache Spark (https://spark.apache.org/), which is a popular open source implementation of in-memory computing. Spark can be easily integrated with Hadoop, and its in-memory data structure Resilient Distributed Dataset (RDD) can be generated from data sources, such as HDFS and HBase, for efficient caching.

Stream processing is used to continuously process and act on the live stream data to get a result. In stream processing, there are two commonly used general-purpose stream processing frameworks: Storm (https://storm.apache.org/) and Flink (https://flink.apache.org/). Both frameworks run on the Java Virtual Machine (JVM) and both process keyed streams. In terms of the programming model, Storm gives you the basic tools to build a framework, while Flink gives you a well-defined and easily used framework. In addition, Samza (http://samza.apache.org/) and Kafka Stream (https://kafka.apache.org/documentation/streams/) leverage Kafka for both message-caching and transformation. Recently, Spark also provides a type of stream processing in terms of its innovative continuous-processing mode.