Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hadoop 2.x Administration Cookbook

You're reading from  Hadoop 2.x Administration Cookbook

Product type Book
Published in May 2017
Publisher Packt
ISBN-13 9781787126732
Pages 348 pages
Edition 1st Edition
Languages
Author (1):
Aman Singh Aman Singh
Profile icon Aman Singh

Table of Contents (20) Chapters

Hadoop 2.x Administration Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
Hadoop Architecture and Deployment Maintaining Hadoop Cluster HDFS Maintaining Hadoop Cluster – YARN and MapReduce High Availability Schedulers Backup and Recovery Data Ingestion and Workflow Performance Tuning HBase Administration Cluster Planning Troubleshooting, Diagnostics, and Best Practices Security Index

Chapter 7. Data Ingestion and Workflow

In this chapter, we will cover the following topics:

  • Hive server modes and setup

  • Using MySQL for Hive metastore

  • Operating Hive with ZooKeeper

  • Loading data into Hive

  • Partitioning and Bucketing in Hive

  • Hive metastore database

  • Designing Hive with credential store

  • Configuring Flume

  • Configure Oozie and workflows

Introduction


Firstly, let us understand what Apache Hive is. Apache Hive is a data warehousing infrastructure built on top of Hadoop that queries the data using SQL. The goal of Hive was to help existing SQL users quickly transition to Hadoop in dealing with structured data, without worrying about the complexities of the Hadoop framework.

In this chapter, we will configure the various methods of data ingestion. Most readers will be well acquainted with the databases and their usage. Databases are also an integral part of Hadoop, and play a very important role in storing, organizing, and retrieving data quickly.

Hadoop works with three types of database: NoSQL databases, such as HBase, warehousing solutions, such as Hive, and for storing metadata, it can use a traditional RDBMS, such as MySQL. In this chapter, we will cover the Hive warehouse and its modes of operation, along with other data ingestion tools, such as Flume, and workflow schedulers, such as Oozie.

Hive server modes and setup


In this recipe, we will look at how to setup a Hive server and use it to query the data stored in a distributed system.

Apache Hive is a client-side library that provides a warehouse solution, which enables representation of data on HDFS in a structure format and querying of it using SQL. The table definitions and mapping are stored in a metastore, which is a combination of a service and a database.

The Hive metastore can run in any of three modes: standalone, local metastore, and remote metastore mode. Standalone or embedded mode is not used in production as it limits the number of connections to just one, and everything runs inside a single JVM.

The Hive driver, metastore interface, and database are the three things that make the Hive connection. The default database is Derby, which is used in standalone mode. In production, an external JDBC-compliant database, such as MySQL, is used in place of Derby, as Derby supports only one client connection which would not...

Using MySQL for Hive metastore


The preceding configuration is the embedded mode configuration, which is not production ready and should not be used.

Now, we will enable the metastore to connect to an external MySQL database and scale to multiple connections. This will be the local metastore mode.

The assumption here is that the readers know the basics of MySQL user management and can assign grants.

How to do it…

  1. Clean up the old databases by either dropping from Hive or simply cleaning the warehouse location. Note: this should never be done in production. This is shown in the following screenshot:

  2. Now, firstly, we need to install the MySQL server on any node in the cluster or outside the Hadoop cluster. In our case, we will install it on the master node master1.cyrus.com:

    # yum install mysql-server –y
    # /etc/init.d/mysqld start
    # chkconfigmysql on
  3. Make sure that the firewall on the master node allows the connection to port 3306 from the Edge node in the cluster. It is better to allow it to connect...

Operating Hive with ZooKeeper


It is important to configure high availability in production so that if one of the hiveserver2 fails, the others can respond to client requests. This can be achieved by using the ZooKeeper discovery mechanism to point the clients to the active Hive servers.

Secondly, to enable concurrency, it is important to run the table manager, which is a lock manager. Both these features require setting up the ZooKeeper ensemble and configuring the Hive clients to use it.

Getting ready

To progress through the recipe in this section, we need a ZooKeeper ensemble running. Please refer to Chapter 4, High Availability for details of how to configure ZooKeeper clients. Secondly, users must have completed the previous recipe Using MySQL for Hive metastore.

How to do it...

  1. Connect to the edge node edge1.cyrus.com in the cluster and switch to the hadoop user.

  2. Modify the hive-site.xml file and enable the table manager by using the properties as follows. This is for concurrency:

    <property...

Loading data into Hive


In this recipe, we look at how we can import data into Hive and also how we can point it to existing data using an external table.

The data store formats for Hive can be text, ORC and parquet, as well as a few other formats. Each one has its advantages in terms of compression, performance, space utilization and memory overheads.

Getting ready

To progress through the recipe, you must have completed the recipe Using MySQL for Hive metastore. There are many examples of each type of Hive distribution at $HIVE_HOME/examples.

How to do it...

  1. Connect to the edge node edge1.cyrus.com in the cluster and switch to the hadoop user.

  2. Connect by either using Hive or the beeline client and import the data by creating a table as shown in the following screenshot:

  3. Now take a look at the HDFS warehouse location. You will see a file named kv1.txt copied there, as shown in the following screenshot:

  4. Describe the table pokes and look at the data, as shown in the following screenshot. What if you...

Partitioning and Bucketing in Hive


In this recipe, we will look at Partitioning and Bucketing concepts in Hive. Partitioning is a scheme by which we reduce the amount of a dataset to be queried for a result.

For example, let's say we keep a record of the entire population in a Hive table, and when someone asks for a person's details, we query it based on the name and gender. This will take a lot of time, as it has to go though many records. Instead, a better approach would be to partition it by country and then query only the specific subset of records. In the RDBMS world, this is known as sharding. Partitioning lets you organize the data in a logical fashion and skip all but the data you need.

Bucketing is a concept of hashing, and based, upon the principle of hashing, it places the data into various buckets, which are themselves files. In simple terms, it is clustering the columns in a group set so as to reduce the number of partitions. Having a large number of partitions is also a bottleneck...

Hive metastore database


In this recipe, we will look at the MySQL database that is used as a metastore database. It is important to understand how the Hive-managed tables are depicted by metadata, and how the metadata database is queried to find the location of tables and their partitions.

Getting ready

For this recipe, you must have completed the Partitioning and Bucketing in Hive recipe and have a basic understanding of MySQL commands and SQL query syntax.

How to do it...

  1. Connect to the MySQL server from any node in the cluster using the following command:

    $ mysql –u hadoop –h master1.cyrus.com -p
    
  2. The username and password can be found in the hive-site.xml file.

  3. Switch to the Hive metastore database, which in our case is hive_db. There are many tables in the databases that together constitute metadata for the tables.

  4. The VERSION table stores information about the schema version, as shown in the following screenshot:

  5. The TBLS table stores information about the tables, as shown in the following...

Designing Hive with credential store


In this recipe, we will be configuring the Hive credential store and using authentication for beeline clients to connect and have access control.

This is supported in Hive beeline clients. The authentication by default is PAM, which is a pluggable authentication module, but Hive can have its own credential store.

Getting ready

Make sure that you have completed the Using MySQL for Hive metastore recipe for this section and that you understand the basic Linux user management.

How to do it...

  1. Connect to the edge nodeedge1.cyrus.com and switch to the hadoop user.

  2. Edit the hive-site.xml file and add the following lines to it:

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
    </property>
    
    <property>
        <name>hive.users.in.admin.role</name>
        <value>root</value>
    </property>
    
    <property>
        <name>hive.security.metastore.authorization.manager</name>...

Configuring Flume


In this recipe, we will cover how to configure Flume for data ingestion. Flume is a general tool that consumes a log stream or Twitter feeds.

In any organization, we might have hundreds of web servers serving web pages, and we may need to quickly parse these logs for ads targeting or triggering events. These Apache web server logs can be streamed to Flume, from where they can be constantly uploaded to HDFS for processing.

In simple terms, Flume is a distributed, reliable, and efficient way of collecting and aggregating data into HDFS. It has the concepts of Flume agents, channels, and sinks, which together make a robust system. There can be multiple sources, channels, and output paths like a file system on a non-HDFS or HDFS filesystem, or being used by other consumers downstream.

Getting ready

For this recipe, make sure that you have completed the Hadoop cluster setup recipe and have at least a healthy HDFS. Flume can be installed on any node in the cluster, but it is good...

Configure Oozie and workflows


In this recipe, we will configure the Oozie workflow engine and look at some examples of scheduling jobs using the Oozie workflow.

Oozie is a scheduler to manage Hadoop jobs, with the ability to make decisions on the conditions or states of previous jobs or the presence of certain files.

Getting ready

Make sure that you have completed the recipe of Hadoop cluster configuration with the edge node configured. HDFS and YARN must be configured and healthy before starting this recipe.

How to do it...

  1. Connect to the edge node edge1.cyrus.com and switch to the hadoop user.

  2. Download the Oozie source package, untar it, and build it:

    $ tar –xzvfoozie-4.1.0.tar.gz
    $ cd oozie-4.1.0
  3. Edit the file pom.xml to make it suitable for the Java version and the Hadoop version. Change the fields according to the version of Java used. For Hadoop 2.x, it must be version 2.3.0:

    <targetJavaVersion>1.8</targetJavaVersion>
    <hadoop.version>2.3.0</hadoop.version>
    <hbase...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Hadoop 2.x Administration Cookbook
Published in: May 2017 Publisher: Packt ISBN-13: 9781787126732
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}