Reader small image

You're reading from  Apache Hive Essentials

Product typeBook
Published inFeb 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781783558575
Edition1st Edition
Languages
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Chapter 5. Data Manipulation

The ability to manipulate data is a critical capability in big data analysis. Manipulating data is the process of exchanging, moving, sorting, and transforming the data. This technique is used in many situations, such as cleaning data, searching patterns, creating trends, and so on. Hive offers various query statements, keywords, operators, and functions to carry out data manipulation.

In this chapter, we will cover the following topics:

  • Data exchange using LOAD, INSERT, IMPORT, and EXPORT

  • Order and sort

  • Operators and functions

  • Transaction

Data exchange – LOAD


To move data in Hive, it uses the LOAD keyword. Move here means the original data is moved to the target table/partition and does not exist in the original place anymore. The following is an example of how to move data to the Hive table or partition from local or HDFS files. The LOCAL keyword specifies where the files are located in the host. If the LOCAL keyword is not specified, the files are loaded from the full Uniform Resource Identifier (URI) specified after INPATH or the value from the fs.default.name Hive property by default. The path after INPATH can be a relative path or an absolute path. The path either points to a file or a folder (all files in the folder) to be loaded, but the subfolder is not allowed in the path specified. If the data is loaded into a partition table, the partition column must be specified. The OVERWRITE keyword is used to decide whether to append or replace the existing data in the target table/partition.

The following are the examples...

Data exchange – INSERT


To extract the data from Hive tables/ partitions, we can use the INSERT keyword. Like RDBMS, Hive supports inserting data by selecting data from other tables. This is a very common way to populate a table from existing data. The basic INSERT statement has the same syntax as a relational database's INSERT. However, Hive has improved its INSERT statement by supporting OVERWRITE, multiple INSERT, dynamic partition INSERT, as well as using INSERT to files. The following are a few examples:

  • The following is a regular INSERT from the SELECT statement:

    --Check the target table, which is empty.
    jdbc:hive2://> SELECT name, work_place, sex_age 
    . . . . . . .> FROM employee;
    +-------------+-------------------+----------------+
    |employee.name|employee.work_place|employee.sex_age|
    +-------------+-------------------+----------------+
    +-------------+-------------------+----------------+
    No rows selected (0.115 seconds)
    
    --Populate data from SELECT
    jdbc:hive2://> INSERT INTO...

Data exchange – EXPORT and IMPORT


When working with Hive, sometimes we need to migrate data among different environments. Or we may need to back up some data. Since Hive 0.8.0, EXPORT and IMPORT statements are available to support the import and export of data in HDFS for data migration or backup/restore purposes.

The EXPORT statement will export both data and metadata from a table or partition. Metadata is exported in a file called _metadata. Data is exported in a subdirectory called data:

jdbc:hive2://> EXPORT TABLE employee TO '/user/dayongd/output3';
No rows affected (0.19 seconds)

After EXPORT, we can manually copy the exported files to other Hive instances or use Hadoop distcp commands to copy to other HDFS clusters. Then, we can import the data in the following manner:

  • Import data to a table with the same name. It throws an error if the table exists:

    jdbc:hive2://> IMPORT FROM '/user/dayongd/output3';
    Error: Error while compiling statement: FAILED: SemanticException [Error 10119...

ORDER and SORT


Another aspect to manipulate data in Hive is to properly order or sort the data or result sets to clearly identify the important facts, such as top N values, maximum, minimum, and so on.

There are the following keywords used in Hive to order and sort data:

  • ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained across all of the output from every reducer. It performs the global sort using only one reducer, so it takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY. When hive.mapred.mode = strict (by default, hive.mapred.mode = nonstrict) is set and we do not specify LIMIT, there are exceptions. This can be used as follows:

    jdbc:hive2://> SELECT name FROM employee ORDER BY NAME DESC;
    +----------+
    |   name   |
    +----------+
    | Will     |
    | Shelley  |
    | Michael  |
    | Lucy     |
    +----------+
    4 rows selected (57.057 seconds)
    
  • SORT BY (ASC|DESC): This indicates which columns to sort when ordering...

Operators and functions


To further manipulate data, we can also use expressions, operators, and functions in Hive to transform data. The Hive wiki (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) has offered specifications for each expression and function, so we do not want to repeat all of them here except a few important usages or tips in this chapter.

Hive has defined relational operators, arithmetic operators, logical operators, complex type constructors, and complex type operators. For relational, arithmetic, and logical operators, they are similar to standard operators in SQL/Java. We do not repeat them again in this chapter. For operators on a complex data type, we have already introduced them in the Understanding Hive data types section of Chapter 3, Data Definition and Description, as well as the example for a dynamic partition insert in this chapter.

The functions in Hive are categorized as follows:

  • Mathematical functions: These functions are mainly used to perform...

Transactions


Before Hive version 0.13.0, Hive does not support row-level transactions. As a result, there is no way to update, insert, or delete rows of data. Hence, data overwrite can only happen on tables or partitions. This makes Hive very difficult when dealing with concurrent read/write and data-cleaning use cases.

Since Hive version 0.13.0, Hive fully supports row-level transactions by offering full Atomicity, Consistency, Isolation, and Durability (ACID) to Hive. For now, all the transactions are autocommuted and only support data in the Optimized Row Columnar (ORC) file (available since Hive 0.11.0) format and in bucketed tables.

The following configuration parameters must be set appropriately to turn on transaction support in Hive:

SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker...

Summary


In this chapter, we covered how to exchange data between Hive and files using the LOAD, INSERT, IMPORT, and EXPORT keywords. Then, we introduced the different Hive ordering and sorting options. We also covered some commonly used tips using Hive functions. Finally, we provided an overview of row-level transactions that are newly supported since Hive 0.13.0. After going through this chapter, we should be able to import or export data to Hive. We should be experienced in using different types of ordering and sorting keywords, Hive functions, and transactions.

In the next chapter, we'll look at the different ways of carrying out data aggregations and sampling in Hive.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials
Published in: Feb 2015Publisher: PacktISBN-13: 9781783558575
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du