Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Data Correlation and Scope

This chapter is about how to discover data by projecting it, linking it, and limiting data ranges or scopes. The chapter mainly covers the syntax and usage of the SELECT, WHERE, LIMIT, JOIN, and UNION/UNION ALL statements to operate on datasets.

In this chapter, we will cover the following topics:

  • Projecting data with SELECT
  • Filtering data with conditions such as WHERE and LIMIT
  • Linking data with JOIN
  • Combining data with UNION

Project data with SELECT

The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT *) or specified columns (with SELECT column1, column2, ...) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF(), or a CASE WHEN THEN ELSE END statement, and regular...

Filtering data with conditions

It is quite common to narrow down the result set by using a condition clause, such as LIMIT, WHERE, IN/NOT IN, and EXISTS/NOT EXISTS. The LIMIT keyword limits the specified number of rows returned randomly. Compared with LIMIT, WHERE is a more powerful and generic condition clause to limit the returned result set by expressions, functions, and nested queries as in the following examples:

> SELECT name FROM employee LIMIT 2;
+----------+
| name |
+----------+
| Lucy |
| Michael |
+----------+
2 rows selected (71.125 seconds)


> SELECT name, work_place FROM employee WHERE name = 'Michael';
+----------+------------------------+
| name | work_place |
+----------+------------------------+
| Michael | ["Montreal","Toronto"] |
+----------+------------------------+
1 row selected (38.107 seconds)

-- All...

Linking data with JOIN

JOIN is used to link rows from two or more tables together. Hive supports most SQL JOIN operations, such as INNER JOIN and OUTER JOIN. In addition, HQL supports some special joins, such as MapJoin and Semi-Join too. In its earlier version, Hive only supported equal join. After v2.2.0, unequal join is also supported. However, you should be more careful when using unequal join unless you know what is expected, since unequal join is likely to return many rows by producing a Cartesian product of joined tables. When you want to restrict the output of a join, you should apply a WHERE clause after join as JOIN occurs before the WHERE clause. If possible, push filter conditions on the join conditions rather than where conditions to have data filtered earlier. What's more, all types of left/right joins are not commutative and always left/right associative, while...

Combining data with UNION

When we want to combine data with the same schema together, we often use set operations. Regular set operations in the relational database are INTERSECT, MINUS, and UNION/UNION ALL. HQL only supports UNION and UNION ALL. The difference between them is that UNION ALL does not remove duplicate rows while UNION does. In addition, all unioned data must have the same name and data type, or else an implicit conversion will be done and may cause a runtime exception. If ORDER BY, SORT BY, CLUSTER BY, DISTRIBUTE BY, or LIMIT are used, they are applied to the whole result set after the union:

> SELECT a.name as nm FROM employee a
> UNION ALL -- Use column alias to make the same name for union
> SELECT b.name as nm FROM employee_hr b;
+-----------+
| nm |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
| Michael |
| Will |
| Steven...

Summary

In this chapter, you learned to use SELECT statements to project the data needed and filter data with WHERE, LIMIT, IN/EXISTS. Then, we introduced different joins to link datasets together, as well as the dataset operations UNION and UNION ALL. After going through this chapter, you should be able to use the SELECT statement with different WHERE conditions, LIMIT, DISTINCT, and complex subqueries. You should be able to understand and use different types of JOIN statements to link the different datasets horizontally or UNION them vertically.

In the next chapter, we will talk about the details of data exchanging, ordering, and transforming, as well as transactions in HQL.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du