Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Data Manipulation

The ability to manipulate data is critical in big data analysis. Manipulating data is the process of exchanging, moving, sorting, transforming, and modifying data. This technique is used in many situations, such as cleaning data, searching patterns, creating trends, and so on. HQL offers various statements, keywords, operators, and functions for carrying out data manipulation.

In this chapter, we will cover the following topics:

  • Data exchange using LOAD, INSERT, IMPORT, and EXPORT
  • Data sorting
  • Functions
  • Transactions and locks

Data exchanging with LOAD

To move data, Hive uses the LOAD statement. Move here means the original data is moved to the target table/partition and does not exist in the original place anymore. The LOCAL keyword in the LOAD statement specifies where the files are located on the client host. If the LOCAL keyword is not specified, the files are loaded from the full Uniform Resource Identifier (URI) specified after INPATH (most of the time, hdfs path) or the value from the fs.default.name property defined in hdfs-site.xml by default. The path after INPATH can be a relative path or an absolute path. The path either points to a file or a folder (referring to all files in the folder) to be loaded, but the subfolder is not allowed in the path specified. If the data is loaded into a partition table, the partition column must be specified. The OVERWRITE keyword is used to decide whether...

Data exchange with INSERT

To extract data from tables/partitions, we can use the INSERT keyword. Like other relational databases, Hive supports inserting data into a table by selecting data from another table. This is a very common ETL (a term in data warehousing for Extract, Transform, and Load) pattern used to populate an existing or new table from another table or dataset. The HQL INSERT statement has the same syntax as a relational database's INSERT. However, HQL has improved its INSERT statement by supporting data overwrittening, multi-insert, dynamic partition insert, as well as inserting data into files. The following are a few examples of INSERT statements in HQL:

  1. The following is a regular INSERT from the SELECT statement:
      -- Check the target table, which is empty.
> SELECT name, work_place FROM employee;
+-------------+-------------------+
...

Data exchange with [EX|IM]PORT

When working on data migration or release deployment, we may need to move data between different environments or clusters. In HQL, EXPORT and IMPORT statements are available to move data between HDFS in different environments or clusters. The EXPORT statement exports both data and metadata from a table or partition. Metadata is exported in a file called _metadata. Data is exported in a subdirectory called data, as follows:

> EXPORT TABLE employee TO '/tmp/output5';
No rows affected (0.19 seconds)

> dfs -ls -R /tmp/output5/;
+--------------------------------+
| DFS Output |
+--------------------------------+
| ... /tmp/output5/_metadata |
| ... /tmp/output5/data |
| ... /tmp/output5/data/000000_0 |
+--------------------------------+
3 rows selected (0.014 seconds)
For EXPORT, the database name can be used...

Data sorting

Another aspect of manipulating data is properly sorting it in order to clearly identify important facts, such as top the N values, maximum, minimum, and so on. HQL supports the following keywords for data sorting:

  1. ORDER BY [ASC|DESC]: It is similar to the SQL ORDER BY statement. When using ORDER BY, a sorted order is maintained across all of the output from every reducer. It performs a global sort using only one reducer, so it takes longer to return the result. The direction specifier after ORDER BY can be either ASC for ascending (low to high) or DESC for descending (high to low). If you do not provide a direction specifier, the default of ascending is used. Since v2.1.0, the ORDER BY statement supports specifying the sorting direction for the NULL value, such as NULL FIRST or NULL LAST. By default, NULL stays at the first place in the ASC direction and the last...

Functions

To further manipulate data, we can also use operators, expressions, and functions in HQL to transform data. The Hive wiki (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) offers specifications for all supported expressions and functions, so we do not want to repeat all of them here, except a few important tips in this chapter.

Hive has defined relational operators, arithmetic operators, logical operators, complex type operators, and complex type constructors. For relational, arithmetic, and logical operators, they are similar to standard operators in SQL/Java. For operators on a complex data type, we have already introduced them in the Understanding Hive data types Section in Chapter 3, Data Definition and Description, as well as the example of inserting data into dynamic partitions earlier in this chapter. Functions in HQL are categorized as follows...

Transactions and locks

ACID (Atomicity, Consistency, Isolation, and Durability) is a long-expected Hive feature, and builds a foundation for relational databases; it has been available since Hive v0.14.0. Full ACID support in Hive is implemented through row-level transactions and locks. This makes it possible for Hive to deal with use cases such as concurrent read/write, data cleaning, data modification, complex ETL/SCD (Slow Changing Dimensions), streaming data ingest, bulk data merge, and so on. In this section, we'll introduce them in more detail.

Transactions

For now, all transactions in HQL are auto-committed without supporting BEGIN, COMMIT, and ROLLBACK, like as with relational databases. Also, the table that has...

Summary

In this chapter, we covered how to exchange data between tables and files using the LOAD, INSERT, IMPORT, and EXPORT keywords. Then, we introduced the different data ordering and sorting options. We also covered some commonly used tips on using functions. Finally, we provided an overview of row-level transactions, DELETE, UPDATE, MERGE, and locks. After going through this chapter, we should be able to import or export data with HQL. We should be experienced in using different types of data sorting keywords, functions, and transaction statements.

In the next chapter, we'll look at the different ways of carrying out data aggregations and sampling in HQL.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du