Using Hive non-interactively (Simple)

Exclusive offer: get 50% off this eBook here
Instant Apache Hive Essentials How-to [Instant]

Instant Apache Hive Essentials How-to [Instant] — Save 50%

Leverage your knowledge of SQL to easily write distributed data processing applications on Hadoop using Apache Hive with this book and ebook

$14.99    $7.50
by Darren Lee | June 2013 | Open Source

So far, we have only used Hive through its interactive console. In this article by Darren Lee, author of Instant Apache Hive Essentials How-to, we will learn how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line. This support allows us to use external tools to start Hive jobs and capture their output.

(For more resources related to this topic, see here.)

Getting ready

Here are the initial steps to be followed:

  1. Create the init.hql file in the current directory. This file sets up the database, specifies our Hive settings, and creates our input table.

    create database if not exists ch3 ;
    use ch3 ;
    set hive.exec.mode.local.auto=true ;
    create table if not exists athlete(
    name string,
    id string,
    demonstration_events_competed_in array<string>,
    demonstration_medals_won array<string>,
    country array<string>,
    medals_won array<string>)
    row format delimited
    fields terminated by '\t'
    collection items terminated by ',' ;
    load data
    local inpath 'data/olympic_athlete.tsv'
    overwrite into table athlete ;

  2. Create the script.hql in the current directory. This file creates our output table and executes a query.

    create table if not exists top_athletes(
    name string,
    num_medals int) ;
    insert overwrite table top_athletes
    select name, size(medals_won) as num_medals
    from athlete
    where size(medals_won) >= ${threshold}
    order by num_medals desc, name asc ;

How to do it...

Follow these steps to complete the task:

  1. We start by running our initialization and query scripts from the command line:

    $ hive -v --hivevar threshold=10 -i init.hql -f script.hql

  2. We write the header of our top_athletes table to a file using the standard cut and paste Unix tools:

    $ hive -S -e "use ch4; describe top_athletes;" | cut -f 1 | paste
    -s - > output.tsv

  3. We write the data from the top_athletes table to our output file by executing explicit SQL statements:

    $ hive -S -e "use ch4; select * from top_athletes ;" >> output.tsv

  4. We can verify the contents of our file by simply using the cat command:

    $ cat output.tsv
    name num_medals
    Michael Phelps 22
    Larissa Latynina 14
    Jenny Thompson 12
    Nikolai Andrianov 12
    Matt Biondi 11
    Ole Einar Bjørndalen 11
    Ryan Lochte 11
    Boris Shakhlin 10
    Takashi Ono 10

How it works...

From the command line, Hive supports three mutually exclusive modes. In addition to the interactive mode that we have used in previous sections, we can pass commands to Hive using two different flags.

Flag

Argument

Effect

-f

filename

Execute the contents of the file

-e

SQL statements

Execute the argument as input

(none)

(none)

Run Hive interactively

For all three of these cases, Hive allows using the -i filename parameter for passing an initialization script. This script will be executed first, then the session will continue with the file contents, explicit statements, or interactive session.

Hive supports using the -i flag multiple times, so we can make our initialization scripts even more reusable by splitting them according to their functionality. For example, we could have one initialization script with common settings used by all jobs running on a particular cluster and a second initialization script for all jobs that use particular table definitions.

In this example, we first use our two files to create the input and output tables and run a query. By using interactive scripts, we can separate the context of our query (the database, data loading, and Hive configuration) from the logic and output. Different processes can share the same context without needing to duplicate the contents of the initialization file across all of their scripts.

We also make our script.sql file reusable with different thresholds through variable substitution. Hive will automatically replace any occurrences of ${variable} with the values passed to the hive command via the – –hivevar parameter. The -d and – –define parameters are synonyms for – –hivevar, and all of these parameters may be specified multiple times if necessary.

The -v flag simply puts Hive into verbose mode, so each statement is echoed to the console as it is executed. Combining the verbose flag, variable substitution, and our scripts gives us the first shell command we executed:

$ hive-v --hivevar threshold=10 -iinit.hql -f script.hql

We then execute two explicit commands, first to describe the columns of the top_athletes table, and then to output its contents. These are redirected to our output file on the local filesystem.

The -S flag puts Hive into silent mode, so only the output of our queries will be written to the files. This helps us capture only the table contents to our output file.

$ hive -S -e "use ch3; describe top_athletes;" | cut -f 1 | paste -s - >
output.tsv

$ hive -S -e "use ch3; select * from top_athletes ;" >> output.tsv

In Hive Versions 0.10.0 and later, we can use the -database ch3 flag instead of specifying use ch3; as part of the query. Alternatively, we could refer to the table by its full name ch3. top_athletes.

Summary

In this recipe, we learned how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line.

Resources for Article :


Further resources on this subject:


Instant Apache Hive Essentials How-to [Instant] Leverage your knowledge of SQL to easily write distributed data processing applications on Hadoop using Apache Hive with this book and ebook
Published: June 2013
eBook Price: $14.99
See more
Select your format and quantity:

About the Author :


Darren Lee

Darren Lee studied mathematics at Stanford and New York University. He is currently the lead engineer on the data team at Bizo in San Francisco, CA. He has been using Hive since 2009 and occasionally blogs about it at http://dev.bizo.com.

Books From Packt


Apache JMeter
Apache JMeter

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

 Apache CXF Web Service Development
Apache CXF Web Service Development

Apache OfBiz Cookbook
Apache OfBiz Cookbook

Apache Wicket Cookbook
Apache Wicket Cookbook

 Apache Maven 2 Effective Implementation
Apache Maven 2 Effective Implementation

Apache MyFaces 1.2 Web Application Development
Apache MyFaces 1.2 Web Application Development


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
k
5
K
z
X
7
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software