Building ETL Pipelines Using Kafka

In the previous chapter, we learned about Confluent Platform. We covered its architecture in detail and discussed its components. You also learned how to export data from Kafka to HDFS using different tools. We went through Camus, Goblin, Flume, and Kafka Connect to cover different ways of bringing data to HDFS. We also recommend you try all the tools discussed in the last chapter to understand how they work. Now we will look into creating an ETL pipeline using these tools and look more closely at Kafka Connect use cases and examples.

In this chapter, we will cover Kafka Connect in detail. The following are the topics we will cover:

Use of Kafka in the ETL pipeline
Introduction to Kafka Connect
Kafka Connect architecture
Deep dive into Kafka Connect
Introductory example of Kafka Connect
Common use cases

...

Considerations for using Kafka in ETL pipelines

ETL is a process of Extracting, Transforming, and Loading data into the target system, which is explained next. It is followed by a large number of organizations to build their data pipelines.

Extraction: Extraction is the process of ingesting data from the source system and making it available for further processing. Any prebuilt tool can be used to extract data from the source system. For example, to extract server logs or Twitter data, you can use Apache Flume, or to extract data from the database, you can use any JDBC-based application, or you can build your own application. The objective of the application that will be used for extraction is that it should not affect the performance of the source system in any manner.
Transformation: Transformation refers to processing extracted data and converting it into some meaningful...

Introducing Kafka Connect

Kafka Connect is used to copy data into and out of Kafka. There are already a lot of tools available to move data from one system to another system. You would find many use cases where you want to do real-time analytics and batch analytics on the same data. Data can come from different sources but finally may land into the same category or type.

We may want to bring this data to Kafka topics and then pass it to a real-time processing engine or store it for batch processing. If you closely look at the following figure, there are different processes involved:

Kafka Connect

Let's look into each component in detail:

Ingestion in Kafka: Data is inserted into Kafka topic from different sources, and most of the time, the type of sources are common. For example you may want to insert server logs into Kafka topics, or insert all records from the database...

Deep dive into Kafka Connect

Let's get into the architecture of Kafka Connect. The following figure gives a good idea of Kafka Connect:

Kafka Connect architecture

Kafka Connect has three major models in its design:

Connector: A Connector is configured by defining the Connector class and configuration. The Connector class is defined based on the source or target of the data, which means that it will be different for the Database source and File source. It is then followed by setting up the configuration for these classes. For example, configuration for the Database source could be the IP of the database, the username and password to connect to the database, and so on. The Connector creates a set of tasks, which is actually responsible for copying data from the source or copying data to the target. Connectors are of two types:
- Source Connector: This is responsible for ingesting...

Introductory examples of using Kafka Connect

Kafka Connect provides us with various Connectors, and we can use the Connectors based on our use case requirement. It also provides an API that can be used to build your own Connector. We will go through a few basic examples in this section. We have tested the code on the Ubuntu machine. Download the Confluent Platform tar file from the Confluent website:

Import or Source Connector: This is used to ingest data from the source system into Kafka. There are already a few inbuilt Connectors available in the Confluent Platform.
Export or Sink Connector: This is used to export data from Kafka topic to external sources. Let's look at a few Connectors available for real-use cases.
JDBC Source Connector: The JDBC Connector can be used to pull data from any JDBC-supported system to Kafka.

Let's see how to use it:

Install sqllite...

Kafka Connect common use cases

You have learned about Kafka Connect in detail. We know Kafka Connect is used for copying data in and out of Kafka.

Let's understand a few common use cases of Kafka Connect:

Copying data to HDFS: User wants to copy data from Kafka topics to HDFS for various reasons. A few want to copy it to HDFS just to take a backup of the history data, others may want to copy it to HDFS for batch processing. However, there are already many open source tools available, such as Camus, Gobblin, Flume, and so on, but maintaining, installing, and running these jobs takes more effort than what Kafka Connect provides. Kafka Connect copies data from topics in parallel and is capable of scaling up more if required.
Replication: Replicating Kafka topics from one cluster to another cluster is also a popular feature offered by Kafka Connect. You may want to replicate...

Summary

In this chapter, we learned about Kafka Connect in detail. We also learned about how we can explore Kafka for an ETL pipeline. We covered examples of JDBC import and export Connector to give you a brief idea of how it works. We expect you to run this program practically to get more insight into what happens when you run Connectors.

In the next chapter, you will learn about Kafka Stream in detail, and we will also see how we can use Kafka stream API to build our own streaming application. We will explore the Kafka Stream API in detail and focus on its advantages.