Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Data Definition and Description

This chapter introduces the basic data types, data definition language, and schema in Hive to describe data. It also covers best practices to describe data correctly and effectively by using internal or external tables, partitions, buckets, and views. In this chapter, we will cover the following topics:

  • Understanding data types
  • Data type conversions
  • Data definition language
  • Databases
  • Tables
  • Partitions
  • Buckets
  • Views

Understanding data types

Hive data types are categorized into two types: primitive and complex. String and Int are the most useful primitive types, which are supported by most HQL functions. The details of primitive types are as follows:

ay contain a set of any type of fields. Complex types allow the nesting of types. The details of complex types a

Primitive type Description Example
TINYINT It has 1 byte, from -128 to 127. The postfix is Y. It is used as a small range of numbers. 10Y
SMALLINT It has 2 bytes, from -32,768 to 32,767. The postfix is S. It is used as a regular descriptive number. 10S
INT It has 4 bytes, from -2,147,483,648 to 2,147,483,647. 10
BIGINT It has 8 bytes, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. The postfix is L. 100L
FLOAT This is a 4 byte single-precision floating-point number, from 1.40129846432481707e-45 to 3.40282346638528860e...

Data type conversions

Similar to SQL, HQL supports both implicit and explicit type conversion. Primitive-type conversion from a narrow to a wider type is known as implicit conversion. However, the reverse conversion is not allowed. All the integral numeric types, FLOAT, and STRING can be implicitly converted to DOUBLE, and TINYINT, SMALLINT, and INT can all be converted to FLOAT. BOOLEAN types cannot be converted to any other type. There is a data type cross-table describing the allowed implicit conversion between every two types, which can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types. Explicit-type conversion uses the CAST function with the CAST(value as TYPE) syntax. For example, CAST('100' as INT) will convert the 100 string to the 100 integer value. If the cast fails, such as CAST('INT' as INT), the function returns...

Data Definition Language

Hive's Data Definition Language (DDL) is a subset of HQL statements that describe the Hive data structure by creating, deleting, or altering schema objects such as databases, tables, views, partitions, and buckets. Most DDL statements start with the CREATE, DROP, or ALTER keywords. The syntax of HQL DDL is very similar to SQL DDL. In the next section, we'll focus on the details of HQL DDL.

HQL uses -- before a single line of characters as comments, and it does not support multiline comments until v2.3.0. After v2.3.0, we can use bracketed single or multiline comments between /* and */.

Database

The database in Hive describes a collection of tables that are used for a similar purpose or belong to the same groups. If the database is not specified, the default database is used and uses /user/hive/warehouse in HDFS as its root directory. This path is configurable by the hive.metastore.warehouse.dir property in hive-site.xml. Whenever a new database is created, Hive creates a new directory for each database under /user/hive/warehouse. For example, the myhivebook database is located at /user/hive/datawarehouse/myhivebook.db. In addition, DATABASE has a name alias, SCHEMA, meaning they are the same thing in HQL. The following is the major DDL for databases operations:

  1. Create the database/schema if it doesn't exist:
      > CREATE DATABASE myhivebook;
> CREATE SCHEMA IF NOT EXISTS myhivebook;
  1. Create the database with the location, comments, and metadata...

Tables

The concept of a table in Hive is very similar to the table in the relational database. Each table maps to a directory, which is under /user/hive/warehouse by default in HDFS. For example, /user/hive/warehouse/employee is created for the employee table. All the data in the table is stored in this hive user-manageable directory (full permission). This kind of table is called an internal, or managed, table. When data is already stored in HDFS, an external table can be created to describe the data. It is called external because the data in the external table is specified in the LOCATION property rather than the default warehouse directory. When keeping data in the internal tables, the table fully manages the data in it. When an internal table is dropped, its data is deleted together. However, when an external table is dropped, the data is not deleted. It is quite common to...

Partitions

By default, a simple HQL query scans the whole table. This slows down the performance when querying a big table. This issue could be resolved by creating partitions, which are very similar to what's in the RDBMS. In Hive, each partition corresponds to a predefined partition column(s), which maps to subdirectories in the table's directory in HDFS. When the table gets queried, only the required partitions (directory) of data in the table are being read, so the I/O and time of the query is greatly reduced. Using partition is a very easy and effective way to improve performance in Hive.

The following is an example of partition creation in HQL:

> CREATE TABLE employee_partitioned (
> name STRING,
> work_place ARRAY<STRING>,
> gender_age STRUCT<gender:STRING,age:INT>,
> skills_score MAP<STRING,INT>,
> depart_title MAP<STRING...

Buckets

Besides partition, the bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from a partition, a bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses year and month as the top-level partition. If there is a further request to use employee_id as the third level of partition, it creates many partition directories. For instance, we can bucket the employee_partitioned table using employee_id as a bucket column. The value of this column will be hashed by a user-defined number of buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). The bucket columns are defined by CLUSTERED BY keywords. It is quite different from partition columns since partition columns refer to the directory, while bucket...

Views

Views are logical data structures that can be used to simplify queries by hiding the complexities, such as joins, subqueries, and filters. It is called logical because views are only defined in metastore without the footprint in HDFS. Unlike what's in the relational database, views in HQL do not store data or get materialized. Once the view is created, its schema is frozen immediately. Subsequent changes to the underlying tables (for example, adding a column) will not be reflected in the view's schema. If an underlying table is dropped or changed, subsequent attempts to query the invalid view will fail. In addition, views are read-only and may not be used as the target of the LOAD/INSERT/ALTER statements.

The following is an example of a view creation statement:

> CREATE VIEW IF NOT EXISTS employee_skills
> AS
> SELECT 
> name, skills_score['DB...

Summary

In this chapter, we learned how to define and use various data types in Hive. We looked at how to create, alter, and drop tables, partitions, and views. We also covered how to use external tables, internal tables, partitions, buckets, and views.

In the next chapter, we will dive into the details of querying data in Hive.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du