Using Hive non-interactively (Simple)

(For more resources related to this topic, see here.)

Getting ready

Here are the initial steps to be followed:

  1. Create the init.hql file in the current directory. This file sets up the database, specifies our Hive settings, and creates our input table.

    create database if not exists ch3 ;
    use ch3 ;
    set ;
    create table if not exists athlete(
    name string,
    id string,
    demonstration_events_competed_in array<string>,
    demonstration_medals_won array<string>,
    country array<string>,
    medals_won array<string>)
    row format delimited
    fields terminated by '\t'
    collection items terminated by ',' ;
    load data
    local inpath 'data/olympic_athlete.tsv'
    overwrite into table athlete ;

  2. Create the script.hql in the current directory. This file creates our output table and executes a query.

    create table if not exists top_athletes(
    name string,
    num_medals int) ;
    insert overwrite table top_athletes
    select name, size(medals_won) as num_medals
    from athlete
    where size(medals_won) >= ${threshold}
    order by num_medals desc, name asc ;

How to do it...

Follow these steps to complete the task:

  1. We start by running our initialization and query scripts from the command line:

    $ hive -v --hivevar threshold=10 -i init.hql -f script.hql

  2. We write the header of our top_athletes table to a file using the standard cut and paste Unix tools:

    $ hive -S -e "use ch4; describe top_athletes;" | cut -f 1 | paste
    -s - > output.tsv

  3. We write the data from the top_athletes table to our output file by executing explicit SQL statements:

    $ hive -S -e "use ch4; select * from top_athletes ;" >> output.tsv

  4. We can verify the contents of our file by simply using the cat command:

    $ cat output.tsv
    name num_medals
    Michael Phelps 22
    Larissa Latynina 14
    Jenny Thompson 12
    Nikolai Andrianov 12
    Matt Biondi 11
    Ole Einar Bjørndalen 11
    Ryan Lochte 11
    Boris Shakhlin 10
    Takashi Ono 10

How it works...

From the command line, Hive supports three mutually exclusive modes. In addition to the interactive mode that we have used in previous sections, we can pass commands to Hive using two different flags.






Execute the contents of the file


SQL statements

Execute the argument as input



Run Hive interactively

For all three of these cases, Hive allows using the -i filename parameter for passing an initialization script. This script will be executed first, then the session will continue with the file contents, explicit statements, or interactive session.

Hive supports using the -i flag multiple times, so we can make our initialization scripts even more reusable by splitting them according to their functionality. For example, we could have one initialization script with common settings used by all jobs running on a particular cluster and a second initialization script for all jobs that use particular table definitions.

In this example, we first use our two files to create the input and output tables and run a query. By using interactive scripts, we can separate the context of our query (the database, data loading, and Hive configuration) from the logic and output. Different processes can share the same context without needing to duplicate the contents of the initialization file across all of their scripts.

We also make our script.sql file reusable with different thresholds through variable substitution. Hive will automatically replace any occurrences of ${variable} with the values passed to the hive command via the – –hivevar parameter. The -d and – –define parameters are synonyms for – –hivevar, and all of these parameters may be specified multiple times if necessary.

The -v flag simply puts Hive into verbose mode, so each statement is echoed to the console as it is executed. Combining the verbose flag, variable substitution, and our scripts gives us the first shell command we executed:

$ hive-v --hivevar threshold=10 -iinit.hql -f script.hql

We then execute two explicit commands, first to describe the columns of the top_athletes table, and then to output its contents. These are redirected to our output file on the local filesystem.

The -S flag puts Hive into silent mode, so only the output of our queries will be written to the files. This helps us capture only the table contents to our output file.

$ hive -S -e "use ch3; describe top_athletes;" | cut -f 1 | paste -s - >

$ hive -S -e "use ch3; select * from top_athletes ;" >> output.tsv

In Hive Versions 0.10.0 and later, we can use the -database ch3 flag instead of specifying use ch3; as part of the query. Alternatively, we could refer to the table by its full name ch3. top_athletes.


In this recipe, we learned how Hive supports uses cases, such as periodic ETL jobs, by rerunning the top athletes query in batch mode from the command line.

Resources for Article :

Further resources on this subject:

You've been reading an excerpt of:

Instant Apache Hive Essentials How-to

Explore Title