Reader small image

You're reading from  Data Lake for Enterprises

Product typeBook
Published inMay 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787281349
Edition1st Edition
Languages
Right arrow
Authors (3):
Vivek Mishra
Vivek Mishra
author image
Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

Tomcy John
Tomcy John
author image
Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

Pankaj Misra
Pankaj Misra
author image
Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra

View More author details
Right arrow

Chapter 5. Data Acquisition of Batch Data using Apache Sqoop

Now that we have discussed some of the essential elements of a data lake in the context of Lambda Architecture, it is imperative that the complete story around data lake starts from capturing the data from source systems, which we are referring to as Data Acquisition.

Data can be acquired from various systems, in which data may exist in various forms. Each of these data formats would need a specific way of data handling such that the data can be acquired from the source system and put to action within the boundaries of data lake.

In this chapter, we would be specifically looking at acquiring data from relational data sources, such as a Relational DataBase Management System (RDBMS) and discuss specific patterns for the same. When it comes to capturing data specifically from relational data sources, Apache Sqoop is one of the primary frameworks that has been widely used as it is a part of the Hadoop ecosystem and has been very dominant...

Context in data lake - data acquisition


The process of inducting data from various source systems is called data acquisition. In our data lake, we have a layer defined (in fact, the first one) which has only this responsibility to take care of.

One of the main technologies that we see doing the main job of inducting data into our data lake is using Apache Sqoop. The following sections of this chapter aim at covering Sqoop in detail so that you get a clear picture of this technology as well as get to know the data acquisition layer in detail.

Data acquisition layer

In Chapter 2, Comprehensive Concepts of a Data Lake you got a glimpse of the data acquisition layer. This layer’s responsibility is to gather information from various source systems and induct it into the data lake. This figure will refresh your memory and give you a good pictorial view of this layer:

Figure 01: Data lake - data acquisition layer

The acquisition layer should be able to handle the following:

  • Bulk data: Bulk data in the...

Why Apache Sqoop


One of the very commonly used tools for data transfer for Apache Hadoop.

In the data acquisition layer, we have chosen Apache Sqoop as the main technology. There are multiple options that can be used in this layer. Also, in place of one technology, there are other options that can be swapped. These options will be discussed in detail to some extent in the last section of this chapter.

Apache Sqoop is one of the main technologies being used to transfer data to and from structured data stores such as RDBMS, traditional data warehouses, and NoSQL data stores to Hadoop. Apache Hadoop finds it very hard to talk to these traditional stores and Sqoop helps to do that integration very easily.

Sqoop helps in the bulk transfer of data from these stores in a very good manner and, because of this reason, Sqoop was chosen as a technology in this layer.

Sqoop also helps to integrate easily with Hadoop based systems such as Apache Oozie, Apache HBase, and Apache Hive.

Apache Oozie is a server...

Workings of Sqoop


For your data lake, you will definitely have to ingest data from traditional applications and data sources. The ingested data, being big, will definitely have to fall into the Hadoop store. Apache Sqoop is one technology that allows you to ingest data from these traditional enterprise data stores into Hadoop with ease.

SQL to Hadoop == SQOOP

The figure below (Figure 03) shows the basic workings of Apache Sqoop. It gives tools to export data from RDBMS to the Hadoop filesystem. It also gives tools to import data from a Hadoop filesystem back to RDBMS.

Figure 03: Basic workings of Sqoop

In our use case, we will be exporting the data stored in RDBMS (PostgreSQL) to the Hadoop File System (HDFS). We will not be looking at Sqoop's import capability in detail, but we will briefly cover that aspect also in this chapter so that you have pretty good knowledge of the different capabilities of this great tool.

As of writing this book, Sqoop has two variations (flavours) called by its major...

Sqoop connectors


Sqoop connector allows Sqoop job to:

  • Connect to the desired database system (import and export)
  • Extract data from the database system (export) and
  • Load the data to the database system (import)

Apache Sqoop allows itself to be extended in the form of having the capability of plugin codes, which is specialized in data transfer with a particular database system. This capability is a part of Sqoop’s extension framework and can be added to any installation of Sqoop. Sqoop 1 does have this capability and Sqoop 2 extends this aspect even further and adds many new features (the comparison section before has covered this aspect). Sqoop 2 has better integration using well defined connector API’s.

For transferring data when Sqoop is invoked, two components come into play, namely:

  • Driver: JDBC is one of the main mechanisms for Sqoop to connect to a RDBMS. The driver in purview of Sqoop refers to JDBC driver. JDBC is a specification given by Java Development Kit (JDK) consisting of various...

Sqoop support for HDFS


Sqoop is natively built for HDFS export and import; however, architecturally it can support and source and target data stores for data exports and imports. In fact, if we observe the convention of the words Import and Export it is all with respect to whether the data is coming into HDFS or going out of HDFS respectively. Sqoop also supports incremental data exports and imports with having an additional attribute/fields for tracking the database incrementals.

Sqoop also supports a number of file formats for optimized storage such as Apache Avro, orc, parquet, and so on. Both parquet and Avro have been very popular file formats with respect to HDFS while orc offers better performance and compression. But as a tradeoff, parquet and Avro formats are relatively more preferred formats due to maintainability and recent enhancements for these formats in HDFS, supporting multi-value fields and search patterns.

Avro is a remote procedure call and data serialization framework developed...

Sqoop working example


We will be using Google Cloud Platform for running the whole use case that we will be covering in this book. Screenshots and code would be covered throughout this book with this in mind so that the reader at the end of this book would have a fully functioning Data Lake in the cloud which slowly could be connected to the real database existing in the enterprise.

Being the first chapter, which is now dealing with installation and code, this chapter will install certain softwares/tools/technologies/libraries that will be referred to in subsequent chapters. In the context of Sqoop, some installations and commands won't be required butare needed for running all of these in the cloud having a clean node with nothing installed on it.

These examples have been prepared and tested on CentOS 7, and this would be our platform for all the examples covered in this book.

Installation and Configuration

For all the installations discussed in this book, we are following some basic conventions...

When to use Sqoop


Apache Sqoop could be employed for many of the data transfer requirements in a data lake, which has HDFS as the main data storage for incoming data from various systems. These bullet points give some of the cases where Apache Sqoop makes more sense:

  • For regular batch and micro-batch to transfer data to and from RDBMS to Hadoop (HDFS/Hive/HBase), use Apache Sqoop. Apache Sqoop is one of the main and widely used technologies in the data acquisition layer.
  • For transferring data from NoSQL data stores like MongoDB and Cassandra into the Hadoop filesystem.
  • Enterprises having good amounts of applications whose stores are based on RDBMS, Sqoop is the best option to transfer data into a Data Lake.
  • Hadoop is a de-facto standard for storing massive data. Sqoop allows you to transfer data easily into HDFS from a traditional database with ease.
  • Use Sqoop when performance is required, as it is able to split and parallelize data transfer.
  • Sqoop has a concept of connectors and, if your enterprise...

When not to use Sqoop


Sqoop is the best suited tool when your data lives in database systems such as Oracle, MySQL, PostgreSQL, and Teradata; Sqoop is not a best fit for event driven data handling. For event driven data, it's apt to go for Apache Flume (Chapter 7Messaging Layer with Apache Kafka in this book covers Flume in detail) as against Sqoop. To summarize, below are the points when Sqoop should not be used:

  • For event driven data.
  • For handling and transferring data which are streamed from various business applications. For example data streamed using JMS from a source system.
  • For handling real-time data as opposed to regular bulk/batch data and micro-batch.
  • Handling data which is in the form of log files generated in different web servers where the business application is hosted.
  • If the source data store should not be put under pressure when a Sqoop job is being executed, it's better to avoid Sqoop. Also, if the bulk/batch have high volumes of data, the pressure that it would put on...

Real-time Sqooping: a possibility?


For real-time data ingestion we don't think Sqoop is a choice. But for near real-time (not less than 5 mins, no particular reason for choosing the time as 5 mins), Sqoop could be used for transferring data. Since these are more frequent, the volume of data should also be in such a way that Sqoop can handle and complete it before the next execution starts.

Other options


For the bulk/batch transfer of data from RDBMS to the Hadoop filesystem there aren't many options in the open source world. However, there are possible choices whereby we could transfer data from RDBMS to Hadoop, and this section tries to give you the reader some possible options so that, according to enterprise demands, they can be evaluated and brought into the data lake as technologies if found suitable.

Native big data connectors

Most of the popular databases have connectors, using which data can be extracted and loaded onto the Hadoop filesystem. For example, if your RDBMS is Oracle, Oracle provides a suite of products which integrate the Oracle database with Apache Hadoop. The figure below (Figure 09) shows the full suite of Oracle Big Data connector products and what they do (details taken from www.oracle.com).

Figure 25: Oracle Big Data connector suite of products

Similar to Oracle, MySQL RDBMS has MySQL Applier, which is the native big data connector which can be used...

Summary


In this chapter, we started introducing or rather mapping technologies into the various data lake layers. In this chapter, we started with the technology introduction in the data acquisition layer. We started the chapter with the layer definition first, and then listed down reasons for choosing Sqoop by detailing both its advantages and disadvantages. We then covered Sqoop and its architecture in detail. While doing so, we covered two important versions of Sqoop, namely version 1 and 2. Soon after this theoretical section, we delved deep into the actual workings of Sqoop by giving the actual setup required to run Sqoop, and then delved deep into our SCV use case and what we are achieving using Sqoop.

After reading this chapter, you should have a clear understanding of the data acquisition layer in our data lake architecture. You should have also gotten in-depth details on Apache Sqoop and what are the reasons for choosing this as a technology of choice for implementation. You would...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Lake for Enterprises
Published in: May 2017Publisher: PacktISBN-13: 9781787281349
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Vivek Mishra

Vivek Mishra is an IT professional with more than nine years of experience in various technologies like Java, J2ee, Hibernate, SCA4J, Mule, Spring, Cassandra, HBase, MongoDB, REDIS, Hive, Hadoop. He has been a contributor for open source like Apache Cassandra and lead committer for Kundera(JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores like Cassandra, HBase, MongoDB and REDIS). Mr Mishra in his previous experience has enjoyed long lasting partnership with most recognizable names in SCM, Banking and finance industries, employing industry standard full software life cycle methodologies Agile and SCRUM. He is currently employed with Impetus infotech pvt. ltd. He has undertaken speaking engagements in cloud camp and Nasscom Big data seminar and is an active blogger and can be followed at mevivs.wordpress.com
Read more about Vivek Mishra

author image
Tomcy John

Tomcy John lives in Dubai (United Arab Emirates), hailing from Kerala (India), and is an enterprise Java specialist with a degree in Engineering (B Tech) and over 14 years of experience in several industries. He's currently working as principal architect at Emirates Group IT, in their core architecture team. Prior to this, he worked with Oracle Corporation and Ernst & Young. His main specialization is in building enterprise-grade applications and he acts as chief mentor and evangelist to facilitate incorporating new technologies as corporate standards in the organization. Outside of his work, Tomcy works very closely with young developers and engineers as mentors and speaks at various forums as a technical evangelist on many topics ranging from web and middleware all the way to various persistence stores.
Read more about Tomcy John

author image
Pankaj Misra

Pankaj Misra has been a technology evangelist, holding a bachelor's degree in engineering, with over 16 years of experience across multiple business domains and technologies. He has been working with Emirates Group IT since 2015, and has worked with various other organizations in the past. He specializes in architecting and building multi-stack solutions and implementations. He has also been a speaker at technology forums in India and has built products with scale-out architecture that support high-volume, near-real-time data processing and near-real-time analytics.
Read more about Pankaj Misra