Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Performance Considerations

Although Hive is built to deal with big data processing, we still cannot ignore the importance of performance. Most of the time, a better query can rely on the smart query optimizer to find the best execution strategy, as well as the default settings and best practices. However, experienced users should learn more about the theory and practice of performance tuning, especially when working on a performance-sensitive project or environment.

In this chapter, we will start using utilities available in HQL to find potential issues causing poor performance. Then, we introduce the best practices for performance considerations in the areas of design, file format, compression, storage, queries, and jobs. In this chapter, we will cover the following topics:

  • Performance utilities
  • Design optimization
  • Data optimization
  • Job optimization
...

Performance utilities

HQL provides the EXPLAIN and ANALYZE statements, which can be used as utilities to check and identify the performance of queries. In addition, Hive logs contain enough detailed information for performance investigation and troubleshooting.

EXPLAIN statement

Hive provides an EXPLAIN statement to return a query execution plan without running the query. We can use it to analyze queries if we have concerns about their performance. The EXPLAIN statement helps us to see the difference between two or more queries for the same purpose. The syntax for it is as follows:

EXPLAIN [FORMATTED|EXTENDED|DEPENDENCY|AUTHORIZATION] hql_query

The following keywords can be used:

  • FORMATTED: This provides a formatted JSON...

Design optimization

Design optimization covers several designs, data formats, and job optimization strategies to improve performance. This will be covered in the following sections.

Partition table design

Hive partitioning is one of the most effective ways to improve query performance on larger tables. A query with partition filtering will only load data from the specified partitions (sub-directories), so it can execute much faster than a normal query that filters by a non-partitioning field. The selection of the partition key is always an important factor for performance. It should always be a low-cardinal attribute to avoid so many sub-directories overhead. The following are some attributes commonly used as partition keys...

Data optimization

Data file optimization covers the performance improvement on the data files in terms of file format, compression, and storage.

File format

Hive supports TEXTFILE, SEQUENCEFILE, AVRO, RCFILE, ORC, and PARQUET file formats. There are two HQL statements used to specify the file format as follows:

  • CREATE TABLE ... STORE AS <file_format>: Specify the file format when creating a table
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT <file_format>: Modify the file format (definition only) in an existing table

Once a table stored in text format is created, we can load text data directly into it. To load text data into tables that have other file formats, we can first load the data into a table...

Job optimization

Job optimization covers experience and skills to improve performance in the areas of job-running mode, JVM reuse, job parallel running, and query join optimizations.

Local mode

Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure it to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launch time of the fully distributed mode takes more time than the job processing time. Since v0.7.0, Hive has supported automatic conversion of a job to run in local mode with the following settings:

> SET hive.exec.mode.local.auto=true; -- default false
> SET hive.exec.mode.local...

Summary

In this chapter, we first covered how to identify performance bottlenecks using EXPLAIN and ANALYZE statements. Then, we spoke about design optimization for performance when using tables, partitions, and indexes. We also covered data file optimization including file format, compression and storage. At the end of this chapter, we discussed job optimization, job engines, and optimizers. After going through this chapter, you should be able to do performance troubleshooting and tuning in Hive. In the next chapter, we'll talk about function extensions for Hive.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du