(For more resources related to this topic, see here.)

Getting ready

Here are the initial steps to be followed:

Create the init.hql file in the current directory. This file sets up the database, specifies our Hive settings, and creates our input table.

create database if not exists ch3 ;
use ch3 ;
set hive.exec.mode.local.auto=true ;
create table if not exists athlete(
name string,
id string,
demonstration_events_competed_in array<string>,
demonstration_medals_won array<string>,
country array<string>,
medals_won array<string>)
row format delimited
fields terminated by 't'
collection items terminated by ',' ;
load data
local inpath 'data/olympic_athlete.tsv'
overwrite into table athlete ;

Create the script.hql in the current directory. This file creates our output table and executes a query.

create table if not exists top_athletes(
name string,
num_medals int) ;
insert overwrite table top_athletes
select name, size(medals_won) as num_medals
from athlete
where size(medals_won) >= ${threshold}
order by num_medals desc, name asc ;

How to do it...

Follow these steps to complete the task:

We start by running our initialization and query scripts from the command line:
```
$ hive -v --hivevar threshold=10 -i init.hql -f script.hql
```
We write the header of our top_athletes table to a file using the standard cut and paste Unix tools
```
$ hive -S -e "use ch4; describe top_athletes;" | cut -f 1 | paste
-s - > output.tsv
```
We write the data from the top_athletes table to our output file by executing explicit SQL statements:
```
$ hive -S -e "use ch4; select * from top_athletes ;" >> output.tsv
```

We can verify the contents of our file by simply using the cat command:

$ cat output.tsv name num_medals Michael Phelps 22 Larissa Latynina 14 Jenny Thompson 12 Nikolai Andrianov 12 Matt Biondi 11 Ole Einar Bjørndalen 11 Ryan Lochte 11 Boris Shakhlin 10 Takashi Ono 10

How it works...

From the command line, Hive supports three mutually exclusive modes. In addition to the interactive mode, we can pass commands to Hive using two different flags.

Flag	Argument	Effect
-f	filename Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at €18.99/month. Cancel anytime	Execute the contents of the file
-e	SQL statements	Execute the argument as input
(none)	(none)	Run Hive interactively

For all three of these cases, Hive allows using the -i filename parameter for passing an initialization script. This script will be executed first, then the session will continue with the file contents, explicit statements, or interactive session.

Hive supports using the -i flag multiple times, so we can make our initialization scripts even more reusable by splitting them according to their functionality. For example, we could have one initialization script with common settings used by all jobs running on a particular cluster and a second initialization script for all jobs that use particular table definitions.

In this example, we first use our two files to create the input and output tables and run a query. By using interactive scripts, we can separate the context of our query (the database, data loading, and Hive configuration) from the logic and output. Different processes can share the same context without needing to duplicate the contents of the initialization file across all of their scripts.

We also make our script.sql file reusable with different thresholds through variable substitution. Hive will automatically replace any occurrences of ${variable} with the values passed to the hive command via the --hivevar parameter. The -d and --define parameters are synonyms for --hivevar, and all of these parameters may be specified multiple times if necessary.

The -v flag simply puts Hive into verbose mode, so each statement is echoed to the console as it is executed. Combining the verbose flag, variable substitution, and our scripts gives us the first shell command we executed:

$ hive-v --hivevar threshold=10 -iinit.hql -f script.hql

We then execute two explicit commands, first to describe the columns of the top_athletes table, and then to output its contents. These are redirected to our output file on the local filesystem.

The -S flag puts Hive into silent mode, so only the output of our queries will be written to the files. This helps us capture only the table contents to our output file.

$ hive -S -e "use ch3; describe top_athletes;" | cut -f 1 | paste -s - >
output.tsv
$ hive -S -e "use ch3; select * from top_athletes ;" >> output.tsv

In Hive Versions 0.10.0 and later, we can use the -database ch3 flag instead of specifying use ch3; as part of the query. Alternatively, we could refer to the table by its full name ch3. top_athletes.

Summary

In this recipe, we learned how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line.