Reader small image

You're reading from  Apache Hive Essentials

Product typeBook
Published inFeb 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781783558575
Edition1st Edition
Languages
Right arrow
Author (1)
Dayong Du
Dayong Du
author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du

Right arrow

Chapter 8. Extensibility Considerations

Although Hive has many built-in functions, users sometimes will need power beyond that provided by built-in functions. For these instances, Hive offers the following three main areas where its functionalities can be extended:

  • User-defined function (UDF): This provides a way to extend functionalities with an external function (mainly written in Java) that can be evaluated in HQL

  • Streaming: This plugs in users' own customized mappers and reducers programs in the data streaming

  • SerDe: This stands for serializers and deserializers and provides a way to serialize or deserialize a custom file format with files stored on HDFS

In this chapter, we'll talk about each of them in more detail.

User-defined functions


Hive defines the following three types of UDF:

  • UDFs: These are regular user-defined functions that operate row-wise and output one result for one row, such as most built-in mathematic and string functions.

  • UDAFs: These are user-defined aggregating functions that operate row-wise or group-wise and output one row or one row for each group as a result, such as the MAX and COUNT built-in functions.

  • UDTFs: These are user-defined table-generating functions that also operate row-wise, but they produce multiple rows/tables as a result, such as the EXPLODE function. UDTF can be used either after SELECT or after the LATERAL VIEW statement.

    Note

    Since Hive is implemented in Java, UDFs should be written in Java as well. Since Java supports running code in other languages through the javax.script API (see http://docs.oracle.com/javase/6/docs/api/javax/script/package-summary.html), UDFs can be written in languages other than Java. In this book, we only focus on Java UDFs.

We'll start...

Streaming


Hive can also leverage the streaming feature in Hadoop to transform data in an alternative way. The streaming API opens an I/O pipe to an external process (script). Then, the process reads data from the standard input and writes the results out through the standard output. In Hive, we can use TRANSFORM clauses in HQL directly, and embed the mapper and the reducer scripts written in commands, shell scripts, Java, or other programming languages. Although streaming brings overhead by using serialization/deserialization between processes, it is a simpler coding mode for developers, especially non-Java developers. The syntax of the TRANSFORM clause is as follows:

FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'map_user_script'
    (AS colName (',' colName)*)?
    (outRowFormat)? (outRecordReader)?
    (CLUSTER BY?|DISTRIBUTE BY? SORT BY?) src_alias
 )
 SELECT TRANSFORM '(' expression (',' expression)* ')'
 (inRowFormat)?
 USING...

SerDe


SerDe stands for Serializer and Deserializer. It is the technology that Hive uses to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data.

The process to read data is as follows:

  1. Data is read from HDFS.

  2. Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE ... STORED AS <FILE_FORMAT> (see Chapter 7, Performance Considerations, for available file formats) to specify which INPUTFORMAT it reads from.

  3. The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.

For an example of reading data, we can use JSON SerDe to read the TEXTFILE format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables with the correct schema.

The process to write data is as follows:

  1. Data (such as using an INSERT...

Summary


In this chapter, we introduced three main areas to extend Hive's functionalities. We also covered three user-defined functions in Hive as well as the coding template and deployment steps to guide your coding and deployment practice. Then, we talked about streaming in Hive to plug in your own code, which does not have to be Java code. At the end of this chapter, we discussed the available SerDe in Hive to parse different formats of data files when reading or writing data. After going through this chapter, we should be able to write basic UDFs, plug code in streamings, and use available SerDe in Hive.

In the next chapter, we'll talk about security considerations for Hive.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Hive Essentials
Published in: Feb 2015Publisher: PacktISBN-13: 9781783558575
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dayong Du

Dayong Du has all his career dedicated to enterprise data and analytics for more than 10 years, especially on enterprise use case with open source big data technology, such as Hadoop, Hive, HBase, Spark, etc. Dayong is a big data practitioner as well as author and coach. He has published the 1st and 2nd edition of Apache Hive Essential and coached lots of people who are interested to learn and use big data technology. In addition, he is a seasonal blogger, contributor, and advisor for big data start-ups, co-founder of Toronto big data professional association.
Read more about Dayong Du