Reader small image

You're reading from  Apache Hive Essentials. - Second Edition

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788995092
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Extensibility Considerations

Although Hive has provided many built-in functions, in special use cases, users may need power beyond what's provided. In this case, we can extend Hive's functionality in three main areas:

  • User-defined function (UDF): This provides a way to extend functionalities with an external function (mainly written in Java) that can be evaluated in HQL
  • HPL/SQL: This provides procedure-language-programming support to HQL
  • Streaming: This plugs a user's own customized programs in to the data streaming
  • SerDe: This stands for serialization and deserialization and provides a way to serialize or deserialize data with the customized file format

In this chapter, we'll talk about each of them in more detail.

User-defined functions

User-defined functions provide a way to use the user's own application/business logic for processing column values during an HQL query. For example, a user-defined function could perform feature cleaning with an external machine learning library, authenticate user access from other services, merge several values into one or many, perform special data encoding or encryption, and other operations that are outside the scope of the regular HQL operators and functions. Hive defines the following three types of user-defined functions, which are extensible:

  • UDF: It stands for User-Defined Function, which operates row-wise and outputs one result for one row, such as most built-in mathematics and string functions.
  • UDAF: It stands for User-Defined Aggregating Function, which operates row-wise or group-wise and outputs one row for the whole table or one row for...

HPL/SQL

Since Hive v2.0.0, the Hadoop Procedure Language SQL (HPL/SQL) (http://www.hplsql.org/) available to provide store procedure programming in Hive. HPL/SQL supports Hive, Spark SQL, and Impala, and is compatible with Oracle, DB2, MySQL, and TSQL standard. One of its benefits is making the migration of existing database-stored procedures to Hive easy and efficient. Using HPL/SQL does not require Java skills to implement what can be done through UDF mentioned. Compared with UDF, HPL/SQL's performance is a little slower and it is still new for production usage.

The following is an example of creating a stored procedure. HPL/SQL supports the creation of both Function and Procedure:

$ cat getEmpCnt.pl
CREATE PROCEDURE getCount()
BEGIN
DECLARE cnt INT = 0;
SELECT COUNT(*) INTO cnt FROM employee;
PRINT 'Users cnt: ' || cnt;
END;

call getCount(); -- Call a procedure

In order...

Streaming

Hive can also leverage the streaming feature in Hadoop to transform data in an alternative way. The streaming API opens an I/O pipe to an external process, such as a script. Then, the process reads data from the standard input and writes the results out through the standard output. In HQL, we can use TRANSFORM clauses directly to embed the mapper and the reducer scripts written in commands, shell scripts, Java, or other programming languages. Although streaming brings overhead by using serialization/deserialization between processes, it provides a simple coding mode for non-Java developers. The syntax of the TRANSFORM clause is as follows:

FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'map_user_script'
    (AS colName (',' colName)*)?
    (outRowFormat)? (outRecordReader...

SerDe

SerDe stands for Serialization and Deserialization. It is the technology used to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data first.

The process to read data is as follows.

  1. Data is read from HDFS.
  2. Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE ... STORED AS <FILE_FORMAT> (see Chapter 9, Performance Considerations) to specify which INPUTFORMAT it reads from.
  3. The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.

For an example of reading data, we can use JSON SerDe to read the TEXTFILE format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables...

Summary

In this chapter, we introduced four main areas to extend Hive's functionalities. We also covered three kinds of user-defined functions as well as their coding templates and deployment steps to guide the coding and deployment process. Then, we introduced HPL/SQL, which adds procedure-language programming to HQL. In addition, we talked about streaming to plug in your own code, which does not have to be Java code. At the end of this chapter, we discussed the available SerDe to parse different formats of data files when reading or writing data. After going through this chapter, you should be able to write basic UDFs and HPL/SQL, plug code into streams, and use available SerDe in Hive.

In the next chapter, we'll talk about security considerations.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials. - Second Edition
Published in: Jun 2018Publisher: PacktISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du