Packt+ | Advance your knowledge in tech

You're reading from Hadoop Real-World Solutions Cookbook - Second Edition

Product typeBook

Published inMar 2016

Publisher

ISBN-139781784395506

Edition2nd Edition

Tools

Hadoop Eclipse

Concepts

Data Processing

Author (1)

Tanmay Deshpande

Chapter 6. Data Import/Export Using Sqoop and Flume

The chapter covers the following topics:

Importing data from RDMBS to HDFS using Sqoop
Exporting data from HDFS to RDBMS
Using query operator in Sqoop import
Importing data using Sqoop in compressed format
Performing Atomic export using Sqoop
Importing data into Hive tables using Sqoop
Importing data into HDFS from Mainframes
Incremental import using Sqoop
Creating and executing Sqoop job
Importing data from RDBMS to Hbase using Sqoop
Importing Twitter data into HDFS using Flume
Importing data from Kafka into HDFS using Flume
Importing web logs data into HDFS using Flume

Introduction

In the previous chapter, we talked about advanced analytics options using Apache Hive. In this chapter, we are going to talk about two very important tools, Sqoop and Flume, which do not directly help us do analytics but help us get the data in and out of Hadoop. So, let's try to understand more about these technologies.

Importing data from RDMBS to HDFS using Sqoop

Most organizations use RDBMS databases as their primary storage for their data. To analyze the data, we need to first import that data on HDFS. Sqoop is a tool that helps us achieve this with ease, and with just a single command, we can import data into HDFS as required. In this recipe, we are going to see how to import data from MySQL to Hadoop using Sqoop.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector from https://dev.mysql.com/downloads/connector/. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

Exporting data from HDFS to RDBMS

In the previous recipe, we spoke about how to import data from MySQL to HDFS. Now it is time to see how to export data from HDFS to RDBMS using Sqoop. Generally, this is required when you want to keep processed data in RDBMS to be used by some reporting tools.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need the MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

In the previous recipe, we imported data from MySQL into HDFS. Now we are going to export this data back to some different MySQL table.

To do this, first we...

Using query operator in Sqoop import

In previous examples, we saw how we import/export complete table. Sometimes you may want to only use selective data, so to achieve this, Sqoop provides a query operation in which we can write SQL query as we need to import data into HDFS.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

Now we, again, log in to MySQL and take a look at the employee table. Let's say we want to import only those employees whose IDs are greater than 20. Then we can either use query...

Importing data using Sqoop in compressed format

In this recipe, we are going to talk about a very important feature of Sqoop that allows us to compress the imported data. We can also choose the compression algorithm.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

Sqoop import allows us to import the data in a compressed format. We don't need to install anything else apart from default installation. The following query will import the employee data into a compressed format and will store...

Performing Atomic export using Sqoop

We have learned in basic database concepts about atomicity, which means doing a complete job or doing nothing. Similarly, if you are exporting data using Sqoop to a table that is very important from the application's point of view and you want make sure that Sqoop should export all data present in HDFS or do nothing, this recipe will help. In this recipe, we are going to see how to ensure atomicity of data export.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to...

Importing data into Hive tables using Sqoop

Till now we have seen how to import data into HDFS folders. Now it's time to understand how to import data directly into Hive table.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of Sqoop installation.

How to do it...

Sqoop provides us the facility to directly import data into Hive table. This saves our time in creating Hive tables, specifying matching schema, loading data into HDFS, and then creating external Hive table. We can do this in a couple of commands:

The following first command...

Importing data into HDFS from Mainframes

Mainframes is one of the most used datasets in financial domain for quite a long time. Sqoop supports importing datasets from Mainframes into HDFS. This is an important recipe for those who are looking to migrate from Mainframes to Hadoop base systems.

Getting ready

How to do it...

Sqoop provides a tool called import-mainframe, using which, we can connect to a certain mainframe host and select the dataset to be imported. The following command connects to a mainframe host with the provided credentials and then imports the mentioned dataset into the HDFS target directory:

 sqoop import-mainframe --connnect <mainframes-host> \
--dataset <...

Incremental import using Sqoop

In an enterprise world, the data gets increased every single day, hour, minute, and second. It's important to import data in an incremental way to do our analysis on up-to-date data. In this recipe, we are going to learn how to import data incrementally.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

To learn about incremental import, we will create a new table in MySQL called newemployee and add some records in it:
```
create table newemployee(id INT NOT NULL AUTO_INCREMENT...
```

Creating and executing Sqoop job

Sqoop provides you the facility to store certain jobs that can be used easily. In this recipe, we are going to see how to create a job and execute the same.

Getting ready

How to do it...

We will create a sqoop job for incremental import that we performed in the last recipe. Following is the command to create a sqoop job:

sqoop job \
--create incremental_import_job \
-- import \
--connect jdbc:mysql://localhost/company \
--username root \
--password...

Importing data from RDBMS to Hbase using Sqoop

These days, lots of people want to make use of the power of NoSQL databases. In order to do so, they need to migrate their existing application from RDBMS to NoSQL databases like Hbase. In this recipe, we are going to learn how to import data from MySQL to Hbase.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Sqoop installed on it. Here I am using Sqoop 1.4.6. We would also need a MySQL database to be present in the network. Installing Sqoop is easy; by downloading Sqoop tar ball and setting it in the system path. As we are going to import data from MySQL, we would also need to download MySQL connector. Based on your MySQL version, download the right connector jar and copy it into the lib directory of the Sqoop installation.

How to do it...

First of we need to create a table in MySQL and insert some records in it:

mysql> desc employee;
+-------+-------------+------+----...

Importing Twitter data into HDFS using Flume

Flume is another tool that helps us import data from various other sources into HDFS. In this recipe, we are going to see how to import Twitter data using Flume. Twitter data is a great source of information provided by individuals. This data can be used to do sentiment analytics of certain products, persons, companies, and so on.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it. Here I am using Flume 1.6.

How to do it...

To import data using Flume, first of all we need to have a Twitter account and we need to generate credentials. These credentials would be used by Flume agent to import the data. Flume by default supports sources to import data from Twitter, so there is no need to do anything else other than having an account and generating credentials.

Following is the step-by-step process to generate Twitter authorization tokens:

Importing data from Kafka into HDFS using Flume

Kafka is one the most popular message queue systems being used these days. We can listen to Kafka topics and put the message data directly into HDFS using Flume. The latest Flume version supports importing data from Kafka easily. In this recipe, we are going to learn how to import Kafka messages to HDFS.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it. Here I am using Flume 1.6. We also need Kafka installed and running on one of the machines. I am using kafka_2.10-0.9.0.0.

How to do it...

To import the data from Kafka, first you need to have Kafka running on your machine. The following command starts Kafka and Zookeeper:
```
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
```

Next I create a topic called weblogs which we will be listening to:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication...

Importing web logs data into HDFS using Flume

One of the most important use cases of Flume is importing logs data into HDFS as and when it is produced. In this recipe, we will be executing a Flume agent which will be listening to the logs file.

Getting ready

To perform this recipe, you should have a Hadoop cluster running with you as well as the latest version of Flume installed on it.

How to do it...

To import data into HDFS from web servers, we have to install Flume agent on each web server instance.Following is the configuration we have to use for Flume agent configuration:

flume1.sources  = weblogs-source-1
flume1.channels = hdfs-channel-1
flume1.sinks    = hdfs-sink-1

# For each source, channel, and sink, set
# standard properties.
flume1.sources.weblogs-source-1.type = exec
flume1.sources.weblogs-source-1.command = tail -f /path/to/log/file.log
flume1.sources.weblogs-source-1.batchSize = 100
flume1.sources.weblogs-source-1.channels = hdfs-channel-1

flume1.channels.hdfs-channel-1.type...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop Real-World Solutions Cookbook - Second Edition

Published in: Mar 2016Publisher: ISBN-13: 9781784395506

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Tanmay Deshpande

Tanmay Deshpande is a Hadoop and big data evangelist. He currently works with Schlumberger as a Big Data Architect in Pune, India. He has interest in a wide range of technologies, such as Hadoop, Hive, Pig, NoSQL databases, Mahout, Sqoop, Java, cloud computing, and so on. He has vast experience in application development in various domains, such as oil and gas, finance, telecom, manufacturing, security, and retail. He enjoys solving machine-learning problems and spends his time reading anything that he can get his hands on. He has great interest in open source technologies and has been promoting them through his talks. Before Schlumberger, he worked with Symantec, Lumiata, and Infosys. Through his innovative thinking and dynamic leadership, he has successfully completed various projects. He regularly blogs on his website http://hadooptutorials.co.in. You can connect with him on LinkedIn at https://www.linkedin.com/in/deshpandetanmay/. He has also authored Mastering DynamoDB, published in August 2014, DynamoDB Cookbook, published in September 2015, Hadoop Real World Solutions Cookbook-Second Edition, published in March 2016, Hadoop: Data Processing and Modelling, published in August, 2016, and Hadoop Blueprints, published in September 2016, all by Packt Publishing.
Read more about Tanmay Deshpande

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages