Packt+ | Advance your knowledge in tech

You're reading from Mastering Hadoop 3

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781788620444

Pages 544 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Authors (2):

Chanchal Singh

Manish Kumar

View More author details

Table of Contents (23) Chapters

Title Page

Dedication

About Packt

Foreword

Contributors

Preface

Journey to Hadoop 3

Deep Dive into the Hadoop Distributed File System

YARN Resource Management in Hadoop

Internals of MapReduce

SQL on Hadoop

Real-Time Processing Engines

Widely Used Hadoop Ecosystem Components

Designing Applications in Hadoop

Real-Time Stream Processing in Hadoop

Machine Learning in Hadoop

Hadoop in the Cloud

Hadoop Cluster Profiling

Who Can Do What in Hadoop

Network and Data Security

Monitoring Hadoop

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 5. SQL on Hadoop

Hadoop is traditionally used as a File System with the capability to process high data volumes using distributed algorithms. However, with its growing popularity among non-programmers and business analysts, there is a need to read and manipulate high volume records using simple, well-known interfaces. SQL is always popular among non-programmers and data analysts because of its simple constructs and easy-to-understand logical syntax. Since Hadoop is used as storage for large volumes of data and because data exploration on top of Hadoop is one of the key use cases, SQL is ideal. Keeping those goals in mind, many SQL engines are developed to process and explore data stored in the Hadoop File System. There are many SQL distributions on Hadoop. Most of them are open source. We will look into those one by one in the following sections.

In this chapter, we will cover the following topics:

Presto
Hive
Impala

Technical requirements

You will be required to have Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter05

Check out the following video to see the code in action:http://bit.ly/2GOVwKt

Presto – introduction

The growing popularity of big data use cases has bought many new technologies and frameworks each of them comes with scalability, high throughput, and low latency in mind. Some companies have very large data warehouses storing hundreds of petabytes of data, and the data is used for various applications such as machine learning, batch analytics, and more. The data is used by technical engineering teams to get insights into businesses, which helps improve the product or services and yields new opportunities to generate more revenue for companies.

The performance of data warehouses plays an important role, as fast results will always help in quicker decision making. Data warehouses should have the ability to run queries in parallel and give results in less time to help businesses increase their productivity and profitability. It is also important to monitor the cost of the warehouse, which will also have an impact on the profitability of the organization. Hadoop came to...

Hive

The Hadoop ecosystem has helped organizations to save costs working with large datasets. Most Hadoop implementations use commodity hardware for storage and processing. This helps companies build low-cost infrastructures to provide high availability and scalable processing power. However, Hadoop's MapReduce processing model was mostly written in Java. The existing data storage infrastructure was mostly developed on traditional relational databases that uses SQL for data processing. Thus, it is necessary to have a tool that can provide similar functionality in the Hadoop ecosystem.

Hive is a data warehouse tool that can process huge amounts of data stored over a distributed storage system, like HDFS using SQL-like queries. The user uses Hive query language, which is very much similar to other SQL-like languages. Hive was developed with the purpose of easing the job of data warehouse users who have strong knowledge of SQL queries and who find it difficult to adopt Java or other languages...

Impala

Impala is a modern, open source massive parallel processing (MPP) SQL engine designed to work with a Hadoop environment. It provides the ability to execute queries with low latency. Hive does not meet the expectation for use cases requiring interactive analytics in a multi-user environment. Impala is integrated into the Hadoop environment and uses a number of standard Hadoop components such as Metastore, HDFS, HBase, YARN, and Sentry. Unlike hive, it does not run MapReduce jobs to get results. Hive uses the MapReduce engine for execution and the intermediate output results are stored on disk, which acts as an input to another job.

Impala architecture

Impala is a massive parallel processing (MPP) distributed query execution engine. It utilizes the resources of an existing Hadoop cluster. It does not use MapReduce. However, it utilizes the data locality feature of Hadoop processing. Let's discuss the Impala architecture and its components in detail.

The following diagram shows the Impala...

Summary

In this chapter, we focused on common SQL components that are used in the Hadoop ecosystem. We also covered the architecture of Hive, Presto, and Impala. Then, we discussed the best practices when using these tools.

In the next chapter, we will focus on the processing engines we use to process huge amounts of data. We will focus more on internal architecture and the in-depth workings of each component. We will also cover a few examples that will help you design your own application.