Packt+ | Advance your knowledge in tech

You're reading from Data Lake for Enterprises

Product typeBook

Published inMay 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781787281349

Edition1st Edition

Languages

Java

Tools

Kafka Hadoop

Concepts

Data Processing

Authors (3):

Vivek Mishra

Tomcy John

Pankaj Misra

View More author details

Chapter 5. Data Acquisition of Batch Data using Apache Sqoop

Now that we have discussed some of the essential elements of a data lake in the context of Lambda Architecture, it is imperative that the complete story around data lake starts from capturing the data from source systems, which we are referring to as Data Acquisition.

Data can be acquired from various systems, in which data may exist in various forms. Each of these data formats would need a specific way of data handling such that the data can be acquired from the source system and put to action within the boundaries of data lake.

In this chapter, we would be specifically looking at acquiring data from relational data sources, such as a Relational DataBase Management System (RDBMS) and discuss specific patterns for the same. When it comes to capturing data specifically from relational data sources, Apache Sqoop is one of the primary frameworks that has been widely used as it is a part of the Hadoop ecosystem and has been very dominant...

Context in data lake - data acquisition

The process of inducting data from various source systems is called data acquisition. In our data lake, we have a layer defined (in fact, the first one) which has only this responsibility to take care of.

One of the main technologies that we see doing the main job of inducting data into our data lake is using Apache Sqoop. The following sections of this chapter aim at covering Sqoop in detail so that you get a clear picture of this technology as well as get to know the data acquisition layer in detail.

Data acquisition layer

In Chapter 2, Comprehensive Concepts of a Data Lake you got a glimpse of the data acquisition layer. This layer’s responsibility is to gather information from various source systems and induct it into the data lake. This figure will refresh your memory and give you a good pictorial view of this layer:

Figure 01: Data lake - data acquisition layer

The acquisition layer should be able to handle the following:

Bulk data: Bulk data in the...

Why Apache Sqoop

One of the very commonly used tools for data transfer for Apache Hadoop.

In the data acquisition layer, we have chosen Apache Sqoop as the main technology. There are multiple options that can be used in this layer. Also, in place of one technology, there are other options that can be swapped. These options will be discussed in detail to some extent in the last section of this chapter.

Apache Sqoop is one of the main technologies being used to transfer data to and from structured data stores such as RDBMS, traditional data warehouses, and NoSQL data stores to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily.

Sqoop helps in the bulk transfer of data from these stores in a very good manner and, because of this reason, Sqoop was chosen as a technology in this layer.

Sqoop also helps to integrate easily with Hadoop based systems such as Apache Oozie, Apache HBase, and Apache Hive.

Apache Oozie is a server...

Workings of Sqoop

For your data lake, you will definitely have to ingest data from traditional applications and data sources. The ingested data, being big, will definitely have to fall into the Hadoop store. Apache Sqoop is one technology that allows you to ingest data from these traditional enterprise data stores into Hadoop with ease.

SQL to Hadoop == SQOOP

The figure below (Figure 03) shows the basic workings of Apache Sqoop. It gives tools to export data from RDBMS to the Hadoop filesystem. It also gives tools to import data from a Hadoop filesystem back to RDBMS.

Figure 03: Basic workings of Sqoop

In our use case, we will be exporting the data stored in RDBMS (PostgreSQL) to the Hadoop File System (HDFS). We will not be looking at Sqoop's import capability in detail, but we will briefly cover that aspect also in this chapter so that you have pretty good knowledge of the different capabilities of this great tool.

As of writing this book, Sqoop has two variations (flavours) called by its major...

Sqoop connectors

Sqoop connector allows Sqoop job to:

Connect to the desired database system (import and export)
Extract data from the database system (export) and
Load the data to the database system (import)

Apache Sqoop allows itself to be extended in the form of having the capability of plugin codes, which is specialized in data transfer with a particular database system. This capability is a part of Sqoop’s extension framework and can be added to any installation of Sqoop. Sqoop 1 does have this capability and Sqoop 2 extends this aspect even further and adds many new features (the comparison section before has covered this aspect). Sqoop 2 has better integration using well defined connector API’s.

For transferring data when Sqoop is invoked, two components come into play, namely:

Driver: JDBC is one of the main mechanisms for Sqoop to connect to a RDBMS. The driver in purview of Sqoop refers to JDBC driver. JDBC is a specification given by Java Development Kit (JDK) consisting of various...

Sqoop support for HDFS

Sqoop is natively built for HDFS export and import; however, architecturally it can support and source and target data stores for data exports and imports. In fact, if we observe the convention of the words Import and Export it is all with respect to whether the data is coming into HDFS or going out of HDFS respectively. Sqoop also supports incremental data exports and imports with having an additional attribute/fields for tracking the database incrementals.

Sqoop also supports a number of file formats for optimized storage such as Apache Avro, orc, parquet, and so on. Both parquet and Avro have been very popular file formats with respect to HDFS while orc offers better performance and compression. But as a tradeoff, parquet and Avro formats are relatively more preferred formats due to maintainability and recent enhancements for these formats in HDFS, supporting multi-value fields and search patterns.

Avro is a remote procedure call and data serialization framework developed...

Sqoop working example

We will be using Google Cloud Platform for running the whole use case that we will be covering in this book. Screenshots and code would be covered throughout this book with this in mind so that the reader at the end of this book would have a fully functioning Data Lake in the cloud which slowly could be connected to the real database existing in the enterprise.

Being the first chapter, which is now dealing with installation and code, this chapter will install certain softwares/tools/technologies/libraries that will be referred to in subsequent chapters. In the context of Sqoop, some installations and commands won't be required butare needed for running all of these in the cloud having a clean node with nothing installed on it.

These examples have been prepared and tested on CentOS 7, and this would be our platform for all the examples covered in this book.

Installation and Configuration

For all the installations discussed in this book, we are following some basic conventions...

When to use Sqoop

Apache Sqoop could be employed for many of the data transfer requirements in a data lake, which has HDFS as the main data storage for incoming data from various systems. These bullet points give some of the cases where Apache Sqoop makes more sense:

For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technologies in the data acquisition layer.
For transferring data from NoSQL data stores like MongoDB and Cassandra into the Hadoop filesystem.
Enterprises having good amounts of applications whose stores are based on RDBMS, Sqoop is the best option to transfer data into a Data Lake.
Hadoop is a de-facto standard for storing massive data. Sqoop allows you to transfer data easily into HDFS from a traditional database with ease.
Use Sqoop when performance is required, as it is able to split and parallelize data transfer.
Sqoop has a concept of connectors and, if your enterprise...

When not to use Sqoop

Sqoop is the best suited tool when your data lives in database systems such as Oracle, MySQL, PostgreSQL, and Teradata; Sqoop is not a best fit for event driven data handling. For event driven data, it's apt to go for Apache Flume (Chapter 7, Messaging Layer with Apache Kafka in this book covers Flume in detail) as against Sqoop. To summarize, below are the points when Sqoop should not be used:

For event driven data.
For handling and transferring data which are streamed from various business applications. For example data streamed using JMS from a source system.
For handling real-time data as opposed to regular bulk/batch data and micro-batch.
Handling data which is in the form of log files generated in different web servers where the business application is hosted.
If the source data store should not be put under pressure when a Sqoop job is being executed, it's better to avoid Sqoop. Also, if the bulk/batch have high volumes of data, the pressure that it would put on...

Real-time Sqooping: a possibility?

For real-time data ingestion we don't think Sqoop is a choice. But for near real-time (not less than 5 mins, no particular reason for choosing the time as 5 mins), Sqoop could be used for transferring data. Since these are more frequent, the volume of data should also be in such a way that Sqoop can handle and complete it before the next execution starts.

Other options

For the bulk/batch transfer of data from RDBMS to the Hadoop filesystem there aren't many options in the open source world. However, there are possible choices whereby we could transfer data from RDBMS to Hadoop, and this section tries to give you the reader some possible options so that, according to enterprise demands, they can be evaluated and brought into the data lake as technologies if found suitable.

Native big data connectors

Most of the popular databases have connectors, using which data can be extracted and loaded onto the Hadoop filesystem. For example, if your RDBMS is Oracle, Oracle provides a suite of products which integrate the Oracle database with Apache Hadoop. The figure below (Figure 09) shows the full suite of Oracle Big Data connector products and what they do (details taken from www.oracle.com).

Figure 25: Oracle Big Data connector suite of products

Similar to Oracle, MySQL RDBMS has MySQL Applier, which is the native big data connector which can be used...

Summary

In this chapter, we started introducing or rather mapping technologies into the various data lake layers. In this chapter, we started with the technology introduction in the data acquisition layer. We started the chapter with the layer definition first, and then listed down reasons for choosing Sqoop by detailing both its advantages and disadvantages. We then covered Sqoop and its architecture in detail. While doing so, we covered two important versions of Sqoop, namely version 1 and 2. Soon after this theoretical section, we delved deep into the actual workings of Sqoop by giving the actual setup required to run Sqoop, and then delved deep into our SCV use case and what we are achieving using Sqoop.

After reading this chapter, you should have a clear understanding of the data acquisition layer in our data lake architecture. You should have also gotten in-depth details on Apache Sqoop and what are the reasons for choosing this as a technology of choice for implementation. You would...

The rest of the chapter is locked

You have been reading a chapter from

Data Lake for Enterprises

Published in: May 2017Publisher: PacktISBN-13: 9781787281349

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra

Other recommended products

Related to this chapter

Apache Kafka 1.0 Cookbook

Apache Kafka is an open source stream processing platform to handle real-time data feeds. This book is a highly practical guide to help you understand the fundamentals as well as the advanced applications of Apache Kafka as an enterprise messaging service. It begins with configuring the basic Kafka APIs, and then shows you how to set up Kafka clusters and basic Kafka operations. It covers the recently released Kafka version 1.0, the Confluent Platform and Kafka Streams. By the end of this book, you will have all the knowledge you need to take your understanding of Apache Kafka to the next level and tackle any problem you might encounter while working with it.

BookDec 2017250 pages

Apache Kafka Quick Start Guide

Learn how to use Apache Kafka for efficient processing of distributed applications. This book focuses on programming rather than configuration management of Kafka clusters or Dev Ops. Each chapter focuses on a practical aspect and tries to avoid the tedious theoretical sections. By the end of this book, you will be familiar with solving everyday problems in fast data and processing pipelines.

BookDec 2018186 pages

Learning Apache Flink

BookFeb 2017280 pages

Modern Big Data Processing with Hadoop

This book presents unique techniques to conquer different Big Data processing and analytics challenges using Hadoop. Practical examples are provided to boost your understanding of Big Data concepts and their implementation. By the end of the book, you will have all the knowledge and skills you need to become a true Big Data expert.

BookMar 2018394 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Building Data Streaming Applications with Apache Kafka

Apache Kafka is a popular distributed streaming platform which acts as a messaging queue or an enterprise messaging system. This book is a comprehensive guide on designing and architecting enterprise-grade streaming applications using Apache Kafka and other Big Data tools. Once you grasp the basics, we will take you through the more advance concepts in Apache Kafka such as capacity planning and security.

BookAug 2017278 pages

Hadoop 2.x Administration Cookbook

A practical and use case driven approach to Hadoop administration with coverage on a vast array of topics including Hadoop cluster installation, performance tuning, cluster planning, security, and much more. This book covers Hadoop from the perspective of running clusters in critical and large environments with complex data and at scale.

BookMay 2017348 pages

Practical Real-time Data Processing and Analytics

Real-time data processing involves continuous input, processing and output of data, with the condition that the time required for processing is as short as possible. This book covers the majority of the existing and evolving open source technology stack for real-time processing and analytics. You will get to know about all the real-time solution aspects, from the source to the presentation to persistence. Through this practical book, you’ll be equipped with a clear understanding of how to solve challenges on your own.

BookSep 2017360 pages

Apache Hive Essentials

Apache Hive helps you deal with data summarization, queries, and analysis for huge amounts of data. This book will give you a background in big data, and familiarize you with your Hive working environment. Next you will cover advanced topics like performance and security in Hive and how to work efficiently to find solutions to big data problems.

BookJun 2018210 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

MySQL 8 for Big Data

MySQL is one of the most popular relational databases in the world today, and has become a popular choice of tool to handle vast amounts of structured data - that is, structured Big Data. This book will demonstrate how you can dabble with large amounts of data using MySQL 8. It also highlights topics such as integrating MySQL 8 and a Big Data solution like Apache Hadoop using different tools like Apache Sqoop and MySQL Applier. With practical examples and use-cases, you will get a better clarity on how you can leverage the offerings of MySQL 8 to build a robust Big Data solution.

BookOct 2017296 pages

HBase High Performance Cookbook

BookJan 2017350 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages