Using Hive non-interactively (Simple)

Exclusive offer: get 0% off this eBook here

— Save 0%

$0.00    $0.00
by Darren Lee | June 2013 | Open Source

So far, we have only used Hive through its interactive console. In this article by Darren Lee, author of Instant Apache Hive Essentials How-to [Instant], we will learn how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line. This support allows us to use external tools to start Hive jobs and capture their output.

(For more resources related to this topic, see here.)

Getting ready

Here are the initial steps to be followed:

  1. Create the init.hql file in the current directory. This file sets up the database, specifies our Hive settings, and creates our input table.

    create database if not exists ch3 ; use ch3 ; set hive.exec.mode.local.auto=true ; create table if not exists athlete( name string, id string, demonstration_events_competed_in array<string>, demonstration_medals_won array<string>, country array<string>, medals_won array<string>) row format delimited fields terminated by '\t' collection items terminated by ',' ; load data local inpath 'data/olympic_athlete.tsv' overwrite into table athlete ;

  2. Create the script.hql in the current directory. This file creates our output table and executes a query.

    create table if not exists top_athletes( name string, num_medals int) ; insert overwrite table top_athletes select name, size(medals_won) as num_medals from athlete where size(medals_won) >= ${threshold} order by num_medals desc, name asc ;

How to do it...

Follow these steps to complete the task:

  1. We start by running our initialization and query scripts from the command line:

    $ hive -v --hivevar threshold=10 -i init.hql -f script.hql

  2. We write the header of our top_athletes table to a file using the standard cut and paste Unix tools

    $ hive -S -e "use ch4; describe top_athletes;" | cut -f 1 | paste -s - > output.tsv

  3. We write the data from the top_athletes table to our output file by executing explicit SQL statements:

    $ hive -S -e "use ch4; select * from top_athletes ;" >> output.tsv

  4. We can verify the contents of our file by simply using the cat command:

    $ cat output.tsv name num_medals Michael Phelps 22 Larissa Latynina 14 Jenny Thompson 12 Nikolai Andrianov 12 Matt Biondi 11 Ole Einar Bjørndalen 11 Ryan Lochte 11 Boris Shakhlin 10 Takashi Ono 10

How it works...

From the command line, Hive supports three mutually exclusive modes. In addition to the interactive mode, we can pass commands to Hive using two different flags.

Flag

Argument

Effect

-f

filename

Execute the contents of the file

-e

SQL statements

Execute the argument as input

(none)

(none)

Run Hive interactively

For all three of these cases, Hive allows using the -i filename parameter for passing an initialization script. This script will be executed first, then the session will continue with the file contents, explicit statements, or interactive session.

Hive supports using the -i flag multiple times, so we can make our initialization scripts even more reusable by splitting them according to their functionality. For example, we could have one initialization script with common settings used by all jobs running on a particular cluster and a second initialization script for all jobs that use particular table definitions.

In this example, we first use our two files to create the input and output tables and run a query. By using interactive scripts, we can separate the context of our query (the database, data loading, and Hive configuration) from the logic and output. Different processes can share the same context without needing to duplicate the contents of the initialization file across all of their scripts.

We also make our script.sql file reusable with different thresholds through variable substitution. Hive will automatically replace any occurrences of ${variable} with the values passed to the hive command via the --hivevar parameter. The -d and --define parameters are synonyms for --hivevar, and all of these parameters may be specified multiple times if necessary.

The -v flag simply puts Hive into verbose mode, so each statement is echoed to the console as it is executed. Combining the verbose flag, variable substitution, and our scripts gives us the first shell command we executed:

$ hive-v --hivevar threshold=10 -iinit.hql -f script.hql

We then execute two explicit commands, first to describe the columns of the top_athletes table, and then to output its contents. These are redirected to our output file on the local filesystem.

The -S flag puts Hive into silent mode, so only the output of our queries will be written to the files. This helps us capture only the table contents to our output file.

$ hive -S -e "use ch3; describe top_athletes;" | cut -f 1 | paste -s - > output.tsv $ hive -S -e "use ch3; select * from top_athletes ;" >> output.tsv

In Hive Versions 0.10.0 and later, we can use the -database ch3 flag instead of specifying use ch3; as part of the query. Alternatively, we could refer to the table by its full name ch3. top_athletes.

Summary

In this recipe, we learned how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line.

Resources for Article :


Further resources on this subject:


About the Author :


Darren Lee

Darren Lee studied mathematics at Stanford and New York University. He is currently the lead engineer on the data team at Bizo in San Francisco, CA. He has been using Hive since 2009 and occasionally blogs about it at http://dev.bizo.com.

Books From Packt


Hadoop Real-World Solutions Cookbook
Hadoop Real-World Solutions Cookbook

 Hadoop Beginner's Guide
Hadoop Beginner's Guide

 Hadoop MapReduce Cookbook
Hadoop MapReduce Cookbook

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache JMeter
Apache JMeter

Apache CXF Web Service Development
Apache CXF Web Service Development

 Instant MapReduce Patterns – Hadoop Essentials How-to [Instant]
Instant MapReduce Patterns – Hadoop Essentials How-to [Instant]


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
U
G
i
t
1
4
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software