Reader small image

You're reading from  Hadoop Real-World Solutions Cookbook - Second Edition

Product typeBook
Published inMar 2016
Publisher
ISBN-139781784395506
Edition2nd Edition
Right arrow
Author (1)
Tanmay Deshpande
Tanmay Deshpande
author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande

Right arrow

Chapter 6. Data Import/Export Using Sqoop and Flume

The chapter covers the following topics:

  • Importing data from RDMBS to HDFS using Sqoop

  • Exporting data from HDFS to RDBMS

  • Using query operator in Sqoop import

  • Importing data using Sqoop in compressed format

  • Performing Atomic export using Sqoop

  • Importing data into Hive tables using Sqoop

  • Importing data into HDFS from Mainframes

  • Incremental import using Sqoop

  • Creating and executing Sqoop job

  • Importing data from RDBMS to Hbase using Sqoop

  • Importing Twitter data into HDFS using Flume

  • Importing data from Kafka into HDFS using Flume

  • Importing web logs data into HDFS using Flume

Introduction


In the previous chapter, we talked about advanced analytics options using Apache Hive. In this chapter, we are going to talk about two very important tools, Sqoop and Flume, which do not directly help us do analytics but help us get the data in and out of Hadoop. So, let's try to understand more about these technologies.

Importing data from RDMBS to HDFS using Sqoop


Most organizations use RDBMS databases as their primary storage for their data. To analyze the data, we need to first import that data on HDFS. Sqoop is a tool that helps us achieve this with ease, and with just a single command, we can import data into HDFS as required. In this recipe, we are going to see how to import data from MySQL to Hadoop using Sqoop.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector from https://dev.mysql.com/downloads/connector/. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

Exporting data from HDFS to RDBMS


In the previous recipe, we spoke about how to import data from MySQL to HDFS. Now it is time to see how to export data from HDFS to RDBMS using Sqoop. Generally, this is required when you want to keep processed data in RDBMS to be used by some reporting tools.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need the MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

In the previous recipe, we imported data from MySQL into HDFS. Now we are going to export this data back to some different MySQL table.

  1. To do this, first we...

Using query operator in Sqoop import


In previous examples, we saw how we import/export complete table. Sometimes you may want to only use selective data, so to achieve this, Sqoop provides a query operation in which we can write SQL query as we need to import data into HDFS.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

  1. Now we, again, log in to MySQL and take a look at the employee table. Let's say we want to import only those employees whose IDs are greater than 20. Then we can either use query...

Importing data using Sqoop in compressed format


In this recipe, we are going to talk about a very important feature of Sqoop that allows us to compress the imported data. We can also choose the compression algorithm.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

  1. Sqoop import allows us to import the data in a compressed format. We don't need to install anything else apart from default installation. The following query will import the employee data into a compressed format and will store...

Performing Atomic export using Sqoop


We have learned in basic database concepts about atomicity, which means doing a complete job or doing nothing. Similarly, if you are exporting data using Sqoop to a table that is very important from the application's point of view and you want make sure that Sqoop should export all data present in HDFS or do nothing, this recipe will help. In this recipe, we are going to see how to ensure atomicity of data export.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to...

Importing data into Hive tables using Sqoop


Till now we have seen how to import data into HDFS folders. Now it's time to understand how to import data directly into Hive table.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

Sqoop provides us the facility to directly import data into Hive table. This saves our time in creating Hive tables, specifying matching schema, loading data into HDFS, and then creating external Hive table. We can do this in a couple of commands:

  1. The following first command...

Importing data into HDFS from Mainframes


Mainframes is one of the most used datasets in financial domain for quite a long time. Sqoop supports importing datasets from Mainframes into HDFS. This is an important recipe for those who are looking to migrate from Mainframes to Hadoop base systems.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in the system path.

How to do it...

Sqoop provides a tool called import-mainframe, using which, we can connect to a certain mainframe host and select the dataset to be imported. The following command connects to a mainframe host with the provided credentials and then imports the mentioned dataset into the HDFS target directory:

 sqoop import-mainframe --connnect <mainframes-host> \
--dataset <...

Incremental import using Sqoop


In an enterprise world, the data gets increased every single day, hour, minute, and second. It's important to import data in an incremental way to do our analysis on up-to-date data. In this recipe, we are going to learn how to import data incrementally.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

  1. To learn about incremental import, we will create a new table in MySQL called newemployee and add some records in it:

    create table newemployee(id INT NOT NULL AUTO_INCREMENT...

Creating and executing Sqoop job


Sqoop provides you the facility to store certain jobs that can be used easily. In this recipe, we are going to see how to create a job and execute the same.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

  1. We will create a sqoop job for incremental import that we performed in the last recipe. Following is the command to create a sqoop job:

    sqoop job \
    --create incremental_import_job \
    -- import \
    --connect jdbc:mysql://localhost/company \
    --username root \
    --password...

Importing data from RDBMS to Hbase using Sqoop


These days, lots of people want to make use of the power of NoSQL databases. In order to do so, they need to migrate their existing application from RDBMS to NoSQL databases like Hbase. In this recipe, we are going to learn how to import data from MySQL to Hbase.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

First of we need to create a table in MySQL and insert some records in it:

mysql> desc employee;
+-------+-------------+------+----...

Importing Twitter data into HDFS using Flume


Flume is another tool that helps us import data from various other sources into HDFS. In this recipe, we are going to see how to import Twitter data using Flume. Twitter data is a great source of information provided by individuals. This data can be used to do sentiment analytics of certain products, persons, companies, and so on.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it. Here I am using Flume 1.6.

How to do it...

To import data using Flume, first of all we need to have a Twitter account and we need to generate credentials. These credentials would be used by Flume agent to import the data. Flume by default supports sources to import data from Twitter, so there is no need to do anything else other than having an account and generating credentials.

Following is the step-by-step process to generate Twitter authorization tokens:

Importing data from Kafka into HDFS using Flume


Kafka is one the most popular message queue systems being used these days. We can listen to Kafka topics and put the message data directly into HDFS using Flume. The latest Flume version supports importing data from Kafka easily. In this recipe, we are going to learn how to import Kafka messages to HDFS.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it. Here I am using Flume 1.6. We also need Kafka installed and running on one of the machines. I am using kafka_2.10-0.9.0.0.

How to do it...

  1. To import the data from Kafka, first you need to have Kafka running on your machine. The following command starts Kafka and Zookeeper:

    bin/zookeeper-server-start.sh config/zookeeper.properties
    bin/kafka-server-start.sh config/server.properties
  2. Next I create a topic called weblogs which we will be listening to:

    bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication...

Importing web logs data into HDFS using Flume


One of the most important use cases of Flume is importing logs data into HDFS as and when it is produced. In this recipe, we will be executing a Flume agent which will be listening to the logs file.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it.

How to do it...

  1. To import data into HDFS from web servers, we have to install Flume agent on each web server instance.Following is the configuration we have to use for Flume agent configuration:

    flume1.sources  = weblogs-source-1
    flume1.channels = hdfs-channel-1
    flume1.sinks    = hdfs-sink-1
    
    # For each source, channel, and sink, set
    # standard properties.
    flume1.sources.weblogs-source-1.type = exec
    flume1.sources.weblogs-source-1.command = tail -f /path/to/log/file.log
    flume1.sources.weblogs-source-1.batchSize = 100
    flume1.sources.weblogs-source-1.channels = hdfs-channel-1
    
    flume1.channels.hdfs-channel-1.type...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop Real-World Solutions Cookbook - Second Edition
Published in: Mar 2016Publisher: ISBN-13: 9781784395506
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande