In this chapter, we will cover the following recipes:
Deploying Hive on a Hadoop cluster
Deploying Hive Metastore
Installing Hive
Configuring HCatalog
Understanding different components of Hive
Compiling Hive from source
Hive packages
Debugging Hive
Running Hive
Changing configurations at runtime
Hive, an Apache Hadoop ecosystem component is developed by Facebook to query the data stored in Hadoop Distributed File System (HDFS). Here, HDFS is the data storage layer of Hadoop that at very high level divides the data into small blocks (default 128 MB) and stores these blocks on different nodes.
Hive provides a SQL-like query model named Hive Query Language (HQL) to access and analyze big data. It is also termed Data Warehousing framework of Hadoop and provides various analytical features, such as windowing and partitioning.
Hive is supported by a wide variety of platforms. GNU/Linux and Windows are commonly used as the production environment, whereas Mac OS X is commonly used as the development environment.
In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.
Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
To install Hive, just download it from http://Hive.apache.org/downloads.html and unpack it. Choose the latest stable version.
By default, Hive is configured to use an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL
. By default, this location is set to the /metastore_dbinconf/hive-default.xml
file. Hive with Derby as metastore in embedded mode allows at most one user at a time.
The other modes of installation are Hive with local metastore and Hive with remote metastore, which will be discussed later.
Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed on these machines. MapReduce works on the distributed data stored in HDFS and processes a large datasets in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process on a node.
The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, which are really important for working on Hive. Let's take a look at these components:
Services to which the client connects and queries the metastore
A backing database to store the metadata
In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions.
Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways:
An embedded metastore
A local metastore
A remote metastore
When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore.
By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files on disk:

An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes at a time.
To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hive service still run in the same JVM. This configuration mode is named local metastore. Here, local means the same environment of the JVM machine as well as the service in the same node.
There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore.
The Hive service is configured to use a remote metastore by setting hive.metastore.uris
to metastore server URIs, separated by commas. The Hive metastore could be configured using properties specified in the following sections.
In the following diagram, the pictorial representation of the metastore and driver is given:

<property> <name>hive.metastore.warehouse.dir</name> <value>/user/Hive/warehouse </value> <description>The directory relative to fs.default.name where managed tables are stored. </description> </property> <property> <name> hive.metastore.uris</name> <value></value> <description> The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion </description> </property> <property> <name>javax.jdo.option. ConnectionURL</name> <value>jdbc:derby:;databaseName=hivemetastore;create=true</value> <description> The JDBC URL of database. </description> </property> <property> <name> javax.jdo.option.ConnectionDriverName </name> <value> org.apache.derby.jdbc.EmbeddedDriver </value> <description> The JDBC driver classname. </description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>username</value> <description>metastore username to connect with </description> </property> <property> <name> javax.jdo.option.ConnectionPassword </name> <value>password</value> <description>metastore password to connect with </description> </property>
We will now take a look at installing Hive along with all the prerequisites.
Let's download the stable version from one of the mirrors:
$ wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
This can be achieved in three ways.
Once you have downloaded the Hive tar-ball
file, installing and setting up a Hive is pretty simple and straightforward. Extract the compressed tar:
$tar –xzvf apache-hive-1.2.1-bin.tar.gz
Export the location where Hive is extracted as the environment variable HIVE_HOME
:
$ cd apache-hive-1.2.1-bin $ export HIVE_HOME={{pwd}}
Hive has all its installation scripts in the $HIVE_HOME/bin
directory. Export this location to the PATH
environment variable so that you can run all scripts from any location directly from a command-line:
$ export PATH=$HIVE_HOME/bin:$PATH
Alternatively, if you want to set the Hive path permanently for the user, then make the entry of Hive environment variables in the .bashrc
or .bash_profile
files available or could be created in the user's home
folder:
Add the following to
~/.bash_profile
:export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin export PATH=$PATH:$HIVE_HOME/bin
Here,
hduser
is the name of user with which you have logged in and Hive-1.2.1 is theHive
directory extracted from thetar
file. Run Hive from a terminal:hive
Make sure that the Hive node has a connection to Hadoop cluster, which means Hive would be installed on any of the Hadoop nodes, or Hadoop configurations are available in the node's class path.
This installation uses the embedded Derby database and stores the data on the local filesystem. Only one Hive session can be open on the node.
If different users try to run the Hive shell, the second would get the
Failed to start database 'metastore_db'
error.Run Hive queries for the datastore to test the installation:
hive> SHOW TABLES; hive> CREATE TABLE sales(id INT, product String, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Logs are generated per user bases in the
/tmp/<usrename>
folder.
Follow these steps to configure Hive with the local metastore. Here, we are using the MySQL database as a metastore:
Add following to
~/.bash_profile
:export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin export PATH=$PATH:$HIVE_HOME/bin
Here,
hduser
is the user name, andapache-hive-1.2.1-bin
is theHive
directory extracted from thetar
file.Install a SQL database such as MySQL on the same machine where you want to run Hive.
For the Ubuntu, MySQL could be installed by running the following command on the node's terminal:
sudo apt-get install mysql-server
In case of MySql, Hive needs the
mysql-connector
jar. Download the latest mysql-connector jar from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz and copy it to thelib
folder of your Hivehome
directory.Create a file,
hive-site.xml
, in theconf
folder of Hive and add the following entries to it:<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hduser</value> <description>user name for connecting to mysql server </description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>passwd</value> <description>password for connecting to mysql server</description> </property> </configuration>
hive
Note
There is a known "JLine
" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the error "unable to load class jline.terminal
," you need to remove the older version of the jline
jar from the yarn
lib folder using the following command:
sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar
Follow these steps to configure Hive with a remote metastore.
Download the latest version of Hive from http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz.
Extract the package:
tar –xzvf apache-hive-1.2.1-bin.tar.gz
Add the following to
~/.bash_profile
:sudo nano ~/.bash_profile export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin export PATH=$PATH:$HIVE_HOME/bin
Here,
hduser
is the user name andapache-hive-1.2.1-bin
is theHive
directory extracted from thetar
file.Install a SQL database such as MySQL on a remote machine to be used for the metastore.
For Ubuntu, MySQL can be installed with the following command:
sudo apt-get install mysql-server
In the case of MySQL, Hive needs the
mysql-connector
jar file. Download the latestmysql-connector
jar from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz and copy it to thelib
folder of your Hivehome
directory.Add the following entries to
hive-site.xml
:<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://<ip_of_remote_host>:3306/metastore_db?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value><description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hduser</value> <description>user name for connecting to mysql server </description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>passwd</value> <description>password for connecting to mysql server</description> </property> </configuration>
Start the Hive metastore interface:
bin/hive --service metastore &
hive
The Hive metastore interface by default listens at port
9083
:netstat -an | grep 9083
Start the Hive shell and make sure that the Hive Data Definition Language and Data Manipulation Language (DDL or DML) operations are working by creating tables in Hive.
Note
There is a known "JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the error "unable to load class jline.terminal," you need to remove the older version of jline jar from the yarn lib folder using the following command:
sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar
Assuming that Hive has been configured in the remote metastore, let's look into how to install and configure HCatalog.
The HCatalog CLI supports these command-line options:
Besides the Hive metastore, Hive components could be broadly classified as Hive clients and Hive servers. Hive servers provide interfaces to make the metastore available to external applications and check for user's authorization and authentication, and Hive clients are various applications used to access and execute Hive queries on the Hadoop cluster.
The metastore service starts as a Java process in the backend. You can start the Hive metastore service with the following command:
hive --service metastore &
HiveServer2 is an interface that allows clients to execute Hive queries and get the result. It is based on Thrift RPC and supports multiple clients a against single client in HiveServer. It also provisioned for the authentication and authorization of the user.
The HiveServer2 service also starts as a Java process in the backend. You can start HiveServer2 with the following command:
hive --service hiveserver2 &
The following are the different clients available in Hive to query metastore data or to submit Hive queries to Hive servers.
Hive Command-line Interface (CLI) can be used to run Hive queries in either interactive or batch mode.
To run Hive CLI, use the following command:
$ HIVE_HOME/bin/hive
Queries are submitted by username of the user logged in to the UNIX system.
In this recipe, we will see how to compile Hive from source.
Apache Hive is an open source framework available for compilation and modification by any user. Hive source code is a maven project. The source has intermittent scripts executed on a UNIX platform during compilation.
The following prerequisites need to be installed:
UNIX OS: UNIX is preferable for Hive source compilation. Although the source could also be compiled on Windows, you need to comment out the intermittent scripts execution.
Maven: The following are the steps to configure maven:
Download the Apache maven binaries for Linux (
.tar.gz
) from https://maven.apache.org/download.cgi.wget http://mirror.olnevhost.net/pub/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-bin.tar.gz
Extract the
tar
file:tar -xzvf apache-maven-3.3.3-bin.tar.gz
Create a folder and move maven binaries to that folder:
sudo mkdir –p /usr/lib/maven mv apache-maven-3.3.3-bin/usr/lib/maven/
Open
/etc/environment
:sudo nano /etc/profile
Add the following variable for the environment
PATH
:export M2_HOME=/usr/lib/maven/apache-maven-3.3.3-bin export M2=$M2_HOME/bin export PATH=$M2:$PATH
Use the command
source /etc/environment
to add variables toPATH
without restart:source /etc/environment
Check whether maven is properly installed or not:
mvn –version
Follow these steps to compile Hive on a Unix OS:
Download the latest version of the Hive source
tar
file:sudo wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-src.tar.gz
Extract the
source
folder:tar –xzvf apache-hive-1.2.1-src.tar.gz
Move to the
Hive
directory:cd apache-hive-1.2.1-src
To import Hive packages in eclipse, run the following command:
mvn eclipse:eclipse
To compile Hive with Hadoop 2 binaries, run the following command:
mvn clean install -Phadoop-2,dist
In case you want to skip tests execution, run the earlier command with the following switch:
mvn clean install –DskipTests -Phadoop-2,dist
To generate a tarball file from the source code, run the following command:
mvn clean package -DskipTests -Phadoop-2 -Pdist
The following are the various sections included in Hive packages.
Hive source consists of different modules categorized by the features they provide or as a submodule of some other module.
The following is the list of Hive modules and their usage in Hive:
accumulo-handler
: Apacheaccumulo
is a distributed key-value datastore based on Google Big Table. This package includes the components responsible for mapping the Hive table to theaccumulo
table.AccumuloStorageHandler
andAccumuloPredicateHandler
are the main classes responsible for mapping tables. For more information, refer to the official integration documentation available at https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.ant
: This tool is used to build earlier versions of Hive source. Ant is also needed to configure the Hive Web Interface server.beeline
: A Hive client used to connect with HiveServer2 and run Hive queries.bin
: This package includes scripts to start Hive clients and services.cli
: This is a Hive Command-line Interface implementation.common
: These are utility classes used by other modules.conf
: This contains default configurations and uses defined configuration objects.contrib
: This containsSerdes
, genericUDF
, andfileformat
contributed by third parties to Hive.hbase-handler
: This module allows Hive SQL statements to access HBase tables forSELECT
andINSERT
commands. It also provides interfaces to access HBase and Hive tables forjoin
andunion
in a single query. More information is available at https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.hcatalog
: This is a table management framework that helps other frameworks such as Pig or MapReduce to access the Hive metastore and table schema.hwi
: This module provides an implementation of a web interface to run Hive queries. Also, theWebHCat
APIs provideREST
APIs to access the Hive metastore.Jdbc
: This is a connector that accepts JDBC connections and calls to execute Hive queries on the cluster.Metastore
: This is the API that provides access to metastore entities including database, table, schema, and serdes.odbc
: This module implements the Open Database Connectivity (ODBC) API, enabling ODBC applications to connect and execute queries over Hive.ql
: This module provides an interface to clients that checks for query semantics and provides an implementation for driver, parser, and query planner.Serde
: This module has an implementation of serializer and deserializer used by Hive to read and write data. It helps in validating and parsing record and field types.shims
: This is the module that transparently intercepts and modifies calls to the Hive API, usually for compatibility purposes.spark-client
: This module provides an interface to execute Hive SQLs on a Spark framework.
Here, we will take a quick look at the command-line debugging option in Hive.
Hive code could be debugged by assigning a port to Hive and adding socket details to Hive JVM. To add debugging configuration to Hive, execute the following properties on an OS terminal or add it to bash_profile
of the user:
export HIVE_DEBUG_PORT=8000 export HIVE_DEBUG="-Xdebug -Xrunjdwp:transport=dt_socket,address=${HIVE_DEBUG_PORT},server=y,suspend=y"
Once a debug port is attached to Hive and Hive server suspension is enabled at startup, the following steps will help you debug Hive queries:
After defining previously mentioned properties, run the Hive CLI in debug mode:
hive --debug
If you have written up your own
Test
class and want to execute unit test cases written in that class, then you need to execute the following command specifying the class name you want to execute:mvn test -Dtest=ClassName
Let's see how to run Hive from the command-line.
Once you have the binaries of Hive either compiled or downloaded, you need to configure a metastore for Hive where it keeps information about different entities. Once that is configured, start Hive metastore and HiveServer2 to access the entities from different clients.
Follow these steps to start different components of Hive on a node:
Run Hive CLI:
$HIVE_HOME/bin/hive
Run HiveServer2 and Beeline:
$HIVE_HOME/bin/hiveserver2 $HIVE_HOME/bin/beeline -u jdbc:Hive2://$HiveServer2_HOST:$HiveServer2_PORT
Run HCatalog and start up the HCatalog server:
$HIVE_HOME/hcatalog/sbin/hcat_server.sh
Run the HCatalog CLI:
$HIVE_HOME/hcatalog/bin/hcat
Run
WebHCat
:$HIVE_HOME/hcatalog/sbin/webhcat_server.sh
Let's see how we can change various configuration settings at runtime.
Follow these steps to change any of the Hive configuration properties at runtime for a particular session or query:
Configuration for Hive and underlying MapReduce could be changed at runtime through beeline or the CLI. The general syntax to set a property is as follows:
SET key=value;
The configuration set is only applicable for that session. If you want to set it permanently, then you need to set it in
Hive-site.xml
. The examples are as follows:beeline> SET mapred.job.tracker=example.host.com:50030; Hive> SET Hive.exec.mode.local.auto=false;