Chapter 7. Data Ingestion and Workflow
In this chapter, we will cover the following topics:
Hive server modes and setup
Using MySQL for Hive metastore
Operating Hive with ZooKeeper
Loading data into Hive
Partitioning and Bucketing in Hive
Hive metastore database
Designing Hive with credential store
Configuring Flume
Configure Oozie and workflows
Firstly, let us understand what Apache Hive is. Apache Hive is a data warehousing infrastructure built on top of Hadoop that queries the data using SQL. The goal of Hive was to help existing SQL users quickly transition to Hadoop in dealing with structured data, without worrying about the complexities of the Hadoop framework.
In this chapter, we will configure the various methods of data ingestion. Most readers will be well acquainted with the databases and their usage. Databases are also an integral part of Hadoop, and play a very important role in storing, organizing, and retrieving data quickly.
Hadoop works with three types of database: NoSQL databases, such as HBase, warehousing solutions, such as Hive, and for storing metadata, it can use a traditional RDBMS, such as MySQL. In this chapter, we will cover the Hive warehouse and its modes of operation, along with other data ingestion tools, such as Flume, and workflow schedulers, such as Oozie.
Hive server modes and setup
In this recipe, we will look at how to setup a Hive server and use it to query the data stored in a distributed system.
Apache Hive is a client-side library that provides a warehouse solution, which enables representation of data on HDFS in a structure format and querying of it using SQL. The table definitions and mapping are stored in a metastore, which is a combination of a service and a database.
The Hive metastore can run in any of three modes: standalone, local metastore, and remote metastore mode. Standalone or embedded mode is not used in production as it limits the number of connections to just one, and everything runs inside a single JVM.
The Hive driver, metastore interface, and database are the three things that make the Hive connection. The default database is Derby, which is used in standalone mode. In production, an external JDBC-compliant database, such as MySQL, is used in place of Derby, as Derby supports only one client connection which would not...
Using MySQL for Hive metastore
The preceding configuration is the embedded mode configuration, which is not production ready and should not be used.
Now, we will enable the metastore to connect to an external MySQL database and scale to multiple connections. This will be the local metastore mode.
The assumption here is that the readers know the basics of MySQL user management and can assign grants.
Clean up the old databases by either dropping from Hive or simply cleaning the warehouse location. Note: this should never be done in production. This is shown in the following screenshot:
Now, firstly, we need to install the MySQL server on any node in the cluster or outside the Hadoop cluster. In our case, we will install it on the master node master1.cyrus.com
:
Make sure that the firewall on the master node allows the connection to port 3306
from the Edge node in the cluster. It is better to allow it to connect...
Operating Hive with ZooKeeper
It is important to configure high availability in production so that if one of the hiveserver2
fails, the others can respond to client requests. This can be achieved by using the ZooKeeper discovery mechanism to point the clients to the active Hive servers.
Secondly, to enable concurrency, it is important to run the table manager, which is a lock manager. Both these features require setting up the ZooKeeper ensemble and configuring the Hive clients to use it.
To progress through the recipe in this section, we need a ZooKeeper ensemble running. Please refer to Chapter 4, High Availability for details of how to configure ZooKeeper clients. Secondly, users must have completed the previous recipe Using MySQL for Hive metastore.
Connect to the edge node edge1.cyrus.com
in the cluster and switch to the hadoop
user.
Modify the hive-site.xml
file and enable the table manager by using the properties as follows. This is for concurrency:
In this recipe, we look at how we can import data into Hive and also how we can point it to existing data using an external table.
The data store formats for Hive can be text, ORC and parquet, as well as a few other formats. Each one has its advantages in terms of compression, performance, space utilization and memory overheads.
To progress through the recipe, you must have completed the recipe Using MySQL for Hive metastore. There are many examples of each type of Hive distribution at $HIVE_HOME/examples
.
Connect to the edge node edge1.cyrus.com
in the cluster and switch to the hadoop
user.
Connect by either using Hive or the beeline client and import the data by creating a table as shown in the following screenshot:
Now take a look at the HDFS warehouse location. You will see a file named kv1.txt
copied there, as shown in the following screenshot:
Describe the table pokes
and look at the data, as shown in the following screenshot. What if you...
Partitioning and Bucketing in Hive
In this recipe, we will look at Partitioning and Bucketing concepts in Hive. Partitioning is a scheme by which we reduce the amount of a dataset to be queried for a result.
For example, let's say we keep a record of the entire population in a Hive table, and when someone asks for a person's details, we query it based on the name and gender. This will take a lot of time, as it has to go though many records. Instead, a better approach would be to partition it by country and then query only the specific subset of records. In the RDBMS world, this is known as sharding. Partitioning lets you organize the data in a logical fashion and skip all but the data you need.
Bucketing is a concept of hashing, and based, upon the principle of hashing, it places the data into various buckets, which are themselves files. In simple terms, it is clustering the columns in a group set so as to reduce the number of partitions. Having a large number of partitions is also a bottleneck...
In this recipe, we will look at the MySQL database that is used as a metastore database. It is important to understand how the Hive-managed tables are depicted by metadata, and how the metadata database is queried to find the location of tables and their partitions.
For this recipe, you must have completed the Partitioning and Bucketing in Hive recipe and have a basic understanding of MySQL commands and SQL query syntax.
Connect to the MySQL server from any node in the cluster using the following command:
The username and password can be found in the hive-site.xml
file.
Switch to the Hive metastore database, which in our case is hive_db
. There are many tables in the databases that together constitute metadata for the tables.
The VERSION
table stores information about the schema version, as shown in the following screenshot:
The TBLS
table stores information about the tables, as shown in the following...
Designing Hive with credential store
In this recipe, we will be configuring the Hive credential store and using authentication for beeline clients to connect and have access control.
This is supported in Hive beeline clients. The authentication by default is PAM, which is a pluggable authentication module, but Hive can have its own credential store.
Make sure that you have completed the Using MySQL for Hive metastore recipe for this section and that you understand the basic Linux user management.
Connect to the edge nodeedge1.cyrus.com
and switch to the hadoop
user.
Edit the hive-site.xml
file and add the following lines to it:
In this recipe, we will cover how to configure Flume for data ingestion. Flume is a general tool that consumes a log stream or Twitter feeds.
In any organization, we might have hundreds of web servers serving web pages, and we may need to quickly parse these logs for ads targeting or triggering events. These Apache web server logs can be streamed to Flume, from where they can be constantly uploaded to HDFS for processing.
In simple terms, Flume is a distributed, reliable, and efficient way of collecting and aggregating data into HDFS. It has the concepts of Flume agents, channels, and sinks, which together make a robust system. There can be multiple sources, channels, and output paths like a file system on a non-HDFS or HDFS filesystem, or being used by other consumers downstream.
For this recipe, make sure that you have completed the Hadoop cluster setup recipe and have at least a healthy HDFS. Flume can be installed on any node in the cluster, but it is good...
Configure Oozie and workflows
In this recipe, we will configure the Oozie workflow engine and look at some examples of scheduling jobs using the Oozie workflow.
Oozie is a scheduler to manage Hadoop jobs, with the ability to make decisions on the conditions or states of previous jobs or the presence of certain files.
Make sure that you have completed the recipe of Hadoop cluster configuration with the edge node configured. HDFS and YARN must be configured and healthy before starting this recipe.
Connect to the edge node edge1.cyrus.com
and switch to the hadoop
user.
Download the Oozie source package, untar it, and build it:
Edit the file pom.xml
to make it suitable for the Java version and the Hadoop version. Change the fields according to the version of Java used. For Hadoop 2.x, it must be version 2.3.0: