Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Hadoop 3

You're reading from  Mastering Hadoop 3

Product type Book
Published in Feb 2019
Publisher Packt
ISBN-13 9781788620444
Pages 544 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (23) Chapters

Title Page
Dedication
About Packt
Foreword
Contributors
Preface
Journey to Hadoop 3 Deep Dive into the Hadoop Distributed File System YARN Resource Management in Hadoop Internals of MapReduce SQL on Hadoop Real-Time Processing Engines Widely Used Hadoop Ecosystem Components Designing Applications in Hadoop Real-Time Stream Processing in Hadoop Machine Learning in Hadoop Hadoop in the Cloud Hadoop Cluster Profiling Who Can Do What in Hadoop Network and Data Security Monitoring Hadoop Other Books You May Enjoy Index

Chapter 5. SQL on Hadoop

Hadoop is traditionally used as a File System with the capability to process high data volumes using distributed algorithms. However, with its growing popularity among non-programmers and business analysts, there is a need to read and manipulate high volume records using simple, well-known interfaces. SQL is always popular among non-programmers and data analysts because of its simple constructs and easy-to-understand logical syntax. Since Hadoop is used as storage for large volumes of data and because data exploration on top of Hadoop is one of the key use cases, SQL is ideal. Keeping those goals in mind, many SQL engines are developed to process and explore data stored in the Hadoop File System. There are many SQL distributions on Hadoop. Most of them are open source. We will look into those one by one in the following sections.

In this chapter, we will cover the following topics:

  • Presto
  • Hive
  • Impala

Technical requirements


You will be required to have Hadoop 3.0.

The code files of this chapter can be found on GitHub:https://github.com/PacktPublishing/Mastering-Hadoop-3/tree/master/Chapter05

Check out the following video to see the code in action:http://bit.ly/2GOVwKt

 

 

Presto – introduction


The growing popularity of big data use cases has bought many new technologies and frameworks each of them comes with scalability, high throughput, and low latency in mind. Some companies have very large data warehouses storing hundreds of petabytes of data, and the data is used for various applications such as machine learning, batch analytics, and more. The data is used by technical engineering teams to get insights into businesses, which helps improve the product or services and yields new opportunities to generate more revenue for companies. 

The performance of data warehouses plays an important role, as fast results will always help in quicker decision making. Data warehouses should have the ability to run queries in parallel and give results in less time to help businesses increase their productivity and profitability. It is also important to monitor the cost of the warehouse, which will also have an impact on the profitability of the organization. Hadoop came to...

Hive


The Hadoop ecosystem has helped organizations to save costs working with large datasets. Most Hadoop implementations use commodity hardware for storage and processing. This helps companies build low-cost infrastructures to provide high availability and scalable processing power. However, Hadoop's MapReduce processing model was mostly written in Java. The existing data storage infrastructure was mostly developed on traditional relational databases that uses SQL for data processing. Thus, it is necessary to have a tool that can provide similar functionality in the Hadoop ecosystem. 

Hive is a data warehouse tool that can process huge amounts of data stored over a distributed storage system, like HDFS using SQL-like queries. The user uses Hive query language, which is very much similar to other SQL-like languages. Hive was developed with the purpose of easing the job of data warehouse users who have strong knowledge of SQL queries and who find it difficult to adopt Java or other languages...

Impala


Impala is a modern, open source massive parallel processing (MPP) SQL engine designed to work with a Hadoop environment. It provides the ability to execute queries with low latency. Hive does not meet the expectation for use cases requiring interactive analytics in a multi-user environment. Impala is integrated into the Hadoop environment and uses a number of standard Hadoop components such as Metastore, HDFS, HBase, YARN, and Sentry. Unlike hive, it does not run MapReduce jobs to get results. Hive uses the MapReduce engine for execution and the intermediate output results are stored on disk, which acts as an input to another job. 

Impala architecture

Impala is a massive parallel processing (MPP) distributed query execution engine. It utilizes the resources of an existing Hadoop cluster. It does not use MapReduce. However, it utilizes the data locality feature of Hadoop processing. Let's discuss the Impala architecture and its components in detail.

 

 

The following diagram shows the Impala...

Summary 


In this chapter, we focused on common SQL components that are used in the Hadoop ecosystem. We also covered the architecture of Hive, Presto, and Impala. Then, we discussed the best practices when using these tools. 

In the next chapter, we will focus on the processing engines we use to process huge amounts of data. We will focus more on internal architecture and the in-depth workings of each component. We will also cover a few examples that will help you design your own application. 

lock icon The rest of the chapter is locked
You have been reading a chapter from
Mastering Hadoop 3
Published in: Feb 2019 Publisher: Packt ISBN-13: 9781788620444
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}