Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Building Data Streaming Applications with Apache Kafka

You're reading from  Building Data Streaming Applications with Apache Kafka

Product type Book
Published in Aug 2017
Publisher Packt
ISBN-13 9781787283985
Pages 278 pages
Edition 1st Edition
Languages
Authors (2):
Chanchal Singh Chanchal Singh
Profile icon Chanchal Singh
Manish Kumar Manish Kumar
Profile icon Manish Kumar
View More author details

Table of Contents (14) Chapters

Preface Introduction to Messaging Systems Introducing Kafka the Distributed Messaging Platform Deep Dive into Kafka Producers Deep Dive into Kafka Consumers Building Spark Streaming Applications with Kafka Building Storm Applications with Kafka Using Kafka with Confluent Platform Building ETL Pipelines Using Kafka Building Streaming Applications Using Kafka Streams Kafka Cluster Deployment Using Kafka in Big Data Applications Securing Kafka Streaming Application Design Considerations

Building ETL Pipelines Using Kafka

In the previous chapter, we learned about Confluent Platform. We covered its architecture in detail and discussed its components. You also learned how to export data from Kafka to HDFS using different tools. We went through Camus, Goblin, Flume, and Kafka Connect to cover different ways of bringing data to HDFS. We also recommend you try all the tools discussed in the last chapter to understand how they work. Now we will look into creating an ETL pipeline using these tools and look more closely at Kafka Connect use cases and examples.

In this chapter, we will cover Kafka Connect in detail. The following are the topics we will cover:

  • Use of Kafka in the ETL pipeline
  • Introduction to Kafka Connect
  • Kafka Connect architecture
  • Deep dive into Kafka Connect
  • Introductory example of Kafka Connect
  • Common use cases
...

Considerations for using Kafka in ETL pipelines

ETL is a process of Extracting, Transforming, and Loading data into the target system, which is explained next. It is followed by a large number of organizations to build their data pipelines.

  • Extraction: Extraction is the process of ingesting data from the source system and making it available for further processing. Any prebuilt tool can be used to extract data from the source system. For example, to extract server logs or Twitter data, you can use Apache Flume, or to extract data from the database, you can use any JDBC-based application, or you can build your own application. The objective of the application that will be used for extraction is that it should not affect the performance of the source system in any manner.

  • Transformation: Transformation refers to processing extracted data and converting it into some meaningful...

Introducing Kafka Connect

Kafka Connect is used to copy data into and out of Kafka. There are already a lot of tools available to move data from one system to another system. You would find many use cases where you want to do real-time analytics and batch analytics on the same data. Data can come from different sources but finally may land into the same category or type.

We may want to bring this data to Kafka topics and then pass it to a real-time processing engine or store it for batch processing. If you closely look at the following figure, there are different processes involved:

Kafka Connect

Let's look into each component in detail:

  • Ingestion in Kafka: Data is inserted into Kafka topic from different sources, and most of the time, the type of sources are common. For example you may want to insert server logs into Kafka topics, or insert all records from the database...

Deep dive into Kafka Connect

Let's get into the architecture of Kafka Connect. The following figure gives a good idea of Kafka Connect:

Kafka Connect architecture

Kafka Connect has three major models in its design:

  • Connector: A Connector is configured by defining the Connector class and configuration. The Connector class is defined based on the source or target of the data, which means that it will be different for the Database source and File source. It is then followed by setting up the configuration for these classes. For example, configuration for the Database source could be the IP of the database, the username and password to connect to the database, and so on. The Connector creates a set of tasks, which is actually responsible for copying data from the source or copying data to the target. Connectors are of two types:
    • Source Connector: This is responsible for ingesting...

Introductory examples of using Kafka Connect

Kafka Connect provides us with various Connectors, and we can use the Connectors based on our use case requirement. It also provides an API that can be used to build your own Connector. We will go through a few basic examples in this section. We have tested the code on the Ubuntu machine. Download the Confluent Platform tar file from the Confluent website:

  • Import or Source Connector: This is used to ingest data from the source system into Kafka. There are already a few inbuilt Connectors available in the Confluent Platform.
  • Export or Sink Connector: This is used to export data from Kafka topic to external sources. Let's look at a few Connectors available for real-use cases.
  • JDBC Source Connector: The JDBC Connector can be used to pull data from any JDBC-supported system to Kafka.

Let's see how to use it:

  1. Install sqllite...

Kafka Connect common use cases

You have learned about Kafka Connect in detail. We know Kafka Connect is used for copying data in and out of Kafka.

Let's understand a few common use cases of Kafka Connect:

  • Copying data to HDFS: User wants to copy data from Kafka topics to HDFS for various reasons. A few want to copy it to HDFS just to take a backup of the history data, others may want to copy it to HDFS for batch processing. However, there are already many open source tools available, such as Camus, Gobblin, Flume, and so on, but maintaining, installing, and running these jobs takes more effort than what Kafka Connect provides. Kafka Connect copies data from topics in parallel and is capable of scaling up more if required.
  • Replication: Replicating Kafka topics from one cluster to another cluster is also a popular feature offered by Kafka Connect. You may want to replicate...

Summary 

In this chapter, we learned about Kafka Connect in detail. We also learned about how we can explore Kafka for an ETL pipeline. We covered examples of JDBC import and export Connector to give you a brief idea of how it works. We expect you to run this program practically to get more insight into what happens when you run Connectors.

In the next chapter, you will learn about Kafka Stream in detail, and we will also see how we can use Kafka stream API to build our own streaming application. We will explore the Kafka Stream API in detail and focus on its advantages.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Building Data Streaming Applications with Apache Kafka
Published in: Aug 2017 Publisher: Packt ISBN-13: 9781787283985
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}