Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Data Aggregation and Sampling

This chapter is about how to aggregate and sample data in HQL. It first covers the use of several aggregate functions, enhanced aggregate functions, and window functions working with a GROUP BY, PARTITION BY statement. Then, it introduces the different ways of sampling data. In this chapter, we will cover the following topics:
  • Basic aggregation
  • Enhanced aggregation
  • Aggregation condition
  • Window functions
  • Sampling

Basic aggregation

Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HQL offers several built-in aggregate functions, such as max(...), min(...), and avg(...). It also supports advanced aggregation using keywords such as GROUPING SETS, ROLLUP, and CUBE, and different types of window function.

The basic built-in aggregate functions are usually used with the GROUP BY clause. If there is no GROUP BY clause specified, it aggregates over the whole row (all columns) by default. Besides aggregate functions, all columns selected must also be included in the GROUP BY clause. The following are a few examples involving the built-in aggregate functions:

  1. Aggregation without GROUP BY columns:
      > SELECT 
> count(*) as rowcnt1,
> count(1) as rowcnt2 -- same to...

Enhanced aggregation

Hive offers enhanced aggregation by using the GROUPING SETS, CUBE, and ROLLUP keywords.

Grouping sets

GROUPING SETS implements advanced multiple GROUP BY operations against the same set of data. Actually, GROUPING SETS are a shorthand way of connecting several GROUP BY result sets with UNION ALL. The GROUPING SETS keyword completes all processes in a single stage of the job, which is more efficient. A blank set () in the GROUPING SETS clause calculates the overall aggregation. The following are a few examples to show the equivalence of GROUPING SETS. For better understanding, we can say that the outer level (brace) of GROUPING SETS defines what data UNION ALL is to be implemented. The inner level (brace...

Aggregation condition

Since v0.7.0, HAVING has been added to support the conditional filtering of aggregation results directly. By using HAVING, we can avoid using a subquery after the GROUP BY statement. See the following example:

> SELECT 
> gender_age.age
> FROM employee
> GROUP BY gender_age.age
> HAVING count(*)=1;
+----------------+
| gender_age.age |
+----------------+
| 27 |
| 30 |
| 35 |
| 57 |
+----------------+
4 rows selected (25.829 seconds)

> SELECT
> gender_age.age,
> count(*) as cnt -- Support use column alias in HAVING, like ORDER BY
> FROM employee
> GROUP BY gender_age.age HAVING cnt=1;
+----------------+-----+
| gender_age.age | cnt |
+----------------+-----+
| 27 | 1 |
| 30 | 1 |
| 35 | 1 |
| 57 | 1 |
+----------------+-----+
4 rows selected (25...

Window functions

Window functions, available since Hive v0.11.0, are a special group of functions that scan multiple input rows to compute each output value. Window functions are usually used with OVER, PARTITION BY, ORDER BY, and the windowing specification. Different from the regular aggregate functions used with the GROUP BY clause, and limited to one result value per group, window functions operate on windows where the input rows are ordered and grouped using flexible conditions expressed through an OVER and PARTITION clause. Window functions give aggregate results, but they do not group the result set. They return the group value multiple times with each record. Window functions offer great flexibility and functionalities compared term the regular GROUP BY clause and make special aggregations by HQL easier and more powerful. The syntax for a window function is as follows...

Sampling

When the data volume is extra large, we may need to find a subset of data to speed up data analysis. This is sampling, a technique used to identify and analyze a subset of data in order to discover patterns and trends in the whole dataset. In HQL, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

Random sampling

Random sampling uses the rand() function and LIMIT keyword to get the sampling of data, as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY rand() statement can also achieve the same purpose, but the performance is not good:

...

Summary

In this chapter, we covered how to aggregate data using basic aggregation functions. Then, we introduced advanced aggregations with GROUPING SETS, ROLLUP, and CUBE, as well as aggregation conditions using HAVING. We also covered the various window functions. At the end of the chapter, we introduced three ways of sampling data. After going through this chapter, you should be able to do basic and advanced aggregations and data sampling in HQL. In the next chapter, we'll talk about performance considerations in Hive.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du