Packt+ | Advance your knowledge in tech

You're reading from Apache Hive Essentials

Product typeBook

Published inFeb 2015

Reading LevelIntermediate

PublisherPackt

ISBN-139781783558575

Edition1st Edition

Languages

SQL

Tools

Hive Hadoop

Concepts

Data Processing

Author (1)

Dayong Du

Chapter 5. Data Manipulation

The ability to manipulate data is a critical capability in big data analysis. Manipulating data is the process of exchanging, moving, sorting, and transforming the data. This technique is used in many situations, such as cleaning data, searching patterns, creating trends, and so on. Hive offers various query statements, keywords, operators, and functions to carry out data manipulation.

In this chapter, we will cover the following topics:

Data exchange using LOAD, INSERT, IMPORT, and EXPORT
Order and sort
Operators and functions
Transaction

Data exchange – LOAD

To move data in Hive, it uses the LOAD keyword. Move here means the original data is moved to the target table/partition and does not exist in the original place anymore. The following is an example of how to move data to the Hive table or partition from local or HDFS files. The LOCAL keyword specifies where the files are located in the host. If the LOCAL keyword is not specified, the files are loaded from the full Uniform Resource Identifier (URI) specified after INPATH or the value from the fs.default.name Hive property by default. The path after INPATH can be a relative path or an absolute path. The path either points to a file or a folder (all files in the folder) to be loaded, but the subfolder is not allowed in the path specified. If the data is loaded into a partition table, the partition column must be specified. The OVERWRITE keyword is used to decide whether to append or replace the existing data in the target table/partition.

The following are the examples...

Data exchange – INSERT

To extract the data from Hive tables/ partitions, we can use the INSERT keyword. Like RDBMS, Hive supports inserting data by selecting data from other tables. This is a very common way to populate a table from existing data. The basic INSERT statement has the same syntax as a relational database's INSERT. However, Hive has improved its INSERT statement by supporting OVERWRITE, multiple INSERT, dynamic partition INSERT, as well as using INSERT to files. The following are a few examples:

The following is a regular INSERT from the SELECT statement:

--Check the target table, which is empty.
jdbc:hive2://> SELECT name, work_place, sex_age 
. . . . . . .> FROM employee;
+-------------+-------------------+----------------+
|employee.name|employee.work_place|employee.sex_age|
+-------------+-------------------+----------------+
+-------------+-------------------+----------------+
No rows selected (0.115 seconds)

--Populate data from SELECT
jdbc:hive2://> INSERT INTO...

Data exchange – EXPORT and IMPORT

When working with Hive, sometimes we need to migrate data among different environments. Or we may need to back up some data. Since Hive 0.8.0, EXPORT and IMPORT statements are available to support the import and export of data in HDFS for data migration or backup/restore purposes.

The EXPORT statement will export both data and metadata from a table or partition. Metadata is exported in a file called _metadata. Data is exported in a subdirectory called data:

jdbc:hive2://> EXPORT TABLE employee TO '/user/dayongd/output3';
No rows affected (0.19 seconds)

After EXPORT, we can manually copy the exported files to other Hive instances or use Hadoop distcp commands to copy to other HDFS clusters. Then, we can import the data in the following manner:

Import data to a table with the same name. It throws an error if the table exists:

jdbc:hive2://> IMPORT FROM '/user/dayongd/output3';
Error: Error while compiling statement: FAILED: SemanticException [Error 10119...

ORDER and SORT

Another aspect to manipulate data in Hive is to properly order or sort the data or result sets to clearly identify the important facts, such as top N values, maximum, minimum, and so on.

There are the following keywords used in Hive to order and sort data:

ORDER BY (ASC|DESC): This is similar to the RDBMS ORDER BY statement. A sorted order is maintained across all of the output from every reducer. It performs the global sort using only one reducer, so it takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY. When hive.mapred.mode = strict (by default, hive.mapred.mode = nonstrict) is set and we do not specify LIMIT, there are exceptions. This can be used as follows:
```
jdbc:hive2://> SELECT name FROM employee ORDER BY NAME DESC;
+----------+
|   name   |
+----------+
| Will     |
| Shelley  |
| Michael  |
| Lucy     |
+----------+
4 rows selected (57.057 seconds)
```
SORT BY (ASC|DESC): This indicates which columns to sort when ordering...

Operators and functions

To further manipulate data, we can also use expressions, operators, and functions in Hive to transform data. The Hive wiki (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) has offered specifications for each expression and function, so we do not want to repeat all of them here except a few important usages or tips in this chapter.

Hive has defined relational operators, arithmetic operators, logical operators, complex type constructors, and complex type operators. For relational, arithmetic, and logical operators, they are similar to standard operators in SQL/Java. We do not repeat them again in this chapter. For operators on a complex data type, we have already introduced them in the Understanding Hive data types section of Chapter 3, Data Definition and Description, as well as the example for a dynamic partition insert in this chapter.

The functions in Hive are categorized as follows:

Mathematical functions: These functions are mainly used to perform...

Transactions

Before Hive version 0.13.0, Hive does not support row-level transactions. As a result, there is no way to update, insert, or delete rows of data. Hence, data overwrite can only happen on tables or partitions. This makes Hive very difficult when dealing with concurrent read/write and data-cleaning use cases.

Since Hive version 0.13.0, Hive fully supports row-level transactions by offering full Atomicity, Consistency, Isolation, and Durability (ACID) to Hive. For now, all the transactions are autocommuted and only support data in the Optimized Row Columnar (ORC) file (available since Hive 0.11.0) format and in bucketed tables.

The following configuration parameters must be set appropriately to turn on transaction support in Hive:

SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker...

Summary

In this chapter, we covered how to exchange data between Hive and files using the LOAD, INSERT, IMPORT, and EXPORT keywords. Then, we introduced the different Hive ordering and sorting options. We also covered some commonly used tips using Hive functions. Finally, we provided an overview of row-level transactions that are newly supported since Hive 0.13.0. After going through this chapter, we should be able to import or export data to Hive. We should be experienced in using different types of ordering and sorting keywords, Hive functions, and transactions.

In the next chapter, we'll look at the different ways of carrying out data aggregations and sampling in Hive.

The rest of the chapter is locked

You have been reading a chapter from

Apache Hive Essentials

Published in: Feb 2015Publisher: PacktISBN-13: 9781783558575

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages