Reader small image

You're reading from  Modern Data Architectures with Python

Product typeBook
Published inSep 2023
Reading LevelExpert
PublisherPackt
ISBN-139781801070492
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Right arrow

Streaming Data with Kafka

There are several streaming platforms on the market, but Apache Kafka is the front-runner. Kafka is an open source project like Spark but focuses on being a distributed message system. Kafka is used for several applications, including microservices and data engineering. Confluent is the largest contributor to Apache Kafka and offers several offerings in the ecosystem, such as Hosted Kafka, Schema Registry, Kafka Connect, and the Kafka REST API, among others. We will go through several areas of Confluent for Kafka, focusing on data processing and movement.

In this chapter, we will cover the following main topics:

  • Kafka architecture
  • Setting Confluent Kafka
  • Kafka streams
  • Schema Registry
  • Spark and Kafka
  • Kafka Connect

Technical requirements

The tooling used in this chapter is tied to the tech stack chosen for the book. All vendors should offer a free trial account.

I will use the following:

  • Databricks
  • Confluent Kafka

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with many others, this chapter assumes you have a working Python 3.6+ release installed in your development environment. We will also assume you have set up an AWS account and have set up Databricks with that account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure –token

We can determine whether the CLI is...

Confluent Kafka

Here, we will look at getting a free account and setting up an initial cluster. Currently, Confluent offers $400 of credit to all new accounts, so we should ideally have no costs for our labs, but we will look for cost savings as we move forward.

Signing up

The signup page is https://www.confluent.io/get-started/, and it allows for several authentication options, including Gmail and GitHub, asking you for personal information such as your name and your company’s name.

You will be presented with the cluster creation page.

Figure 5.1: The cluster creation page

Figure 5.1: The cluster creation page

Here, we will use the Basic cluster, which is free at the time of writing.

The next screen allows us to choose a cloud platform and region.

Figure 5.2: Choosing the platform to use

Figure 5.2: Choosing the platform to use

Here, I choose AWS, Northern Virginia, and a single zone. This is a good choice for our labs, but this would need to be more complex for a production system...

Kafka architecture

Kafka is an open source distributed streaming platform designed to scale to impressive levels. Kafka can store as much data as you have storage for, but it shouldn’t be used as a database. Kafka’s core architecture is composed of five main ideas – topics, brokers, partitions, producers, and consumers.

Topics

For a developer, a Kafka topic is the most important concept to understand. Topics are where data is “stored.” Topics hold data often called events, which means that the data has a key and a value. Keys in this context are not related to a database key that defines uniqueness, but they can be used for organizational purposes. The value is the data itself, which can be in a few different formats such as strings, JSON, Avro, and Protobuf. When your data is written to Kafka, it will have metadata; the most important will be the timestamp.

Working with Kafka can be confusing because the data isn’t stored in a database...

Schema Registry

Kafka guarantees the delivery of events sent from producers, but it does not attempt to guarantee quality. Kafka assumes that your applications can coordinate quality data between consumers and producers. On the surface, this seems reasonable and easy to accomplish. The reality is that even in ideal situations, this type of assumed coordination is unrealistic. This type of problem is common among data producers and consumers; the solution is to enforce a data contract.

The general rule of thumb is, garbage in, garbage out. Confluent Schema Registry is an attempt at building contracts for your data schema in Kafka. Confluent Schema Registry is a layer that sits in front of Kafka and stands as the gatekeeper to Kafka. Events can’t be produced for a topic unless Confluent Schema Registry first gives its blessing. Consumers can know exactly what they will get by checking the Confluent Schema Registry first.

This process happens behind the scenes, and the Confluent...

Kafka Connect

Confluent noticed that many tasks are almost cookie-cutter and could be shared across projects and clients. One example might be listening to a topic and copying it into a Snowflake table. Confluent created Connect to allow engineers to create very generic tasks and deploy them to a cluster to run those tasks. If you need a source or sink connector that doesn’t exist, you can build your own. Connect is a very robust platform that can be a major pipeline component. Connect also offers some basic transformations, which can be very useful. Confluent Cloud offers a hosted connect cluster the same way it offers Schema Registry and Kafka Core.

To access Connect in Confluent Cloud, we will navigate to our cluster and then look for the Connectors option in the menu bar.

Figure 5.10: The Connectors option in the menu bar

Figure 5.10: The Connectors option in the menu bar

Next, we can set up a simple connector that creates fake data. We will then have the option to add a new connector.

...

Spark and Kafka

Spark has a long history of supporting Kafka with both streaming and batch processing. Here, we will go over some of the structured streaming Kafka-related APIs.

Here, we have a streaming read of a Kafka cluster. It will return a streaming DataFrame:

df = spark \
  .readStream \
  .format("kafka")\
  .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\
  .option("subscribe", "<topic>")\
  .load()\\

Conversely, if you want to do a true batch process, you can also read from Kafka. Keep in mind that we have covered techniques to create a streaming context but using a batch style to avoid rereading messages:

df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\
  ...

Practical lab

In these practical labs, we will create common tasks that involve integrating Kafka and Spark to build Delta tables:

  1. Connect to Confluent Kafka using Spark. Use Schema Registry to create a Spark job. This should ingest data from a topic and write to a delta table.
  2. Use the Delta Sink Connector to finalize the table.

It is your task to write a Spark job that ingests events from a Confluent Kafka topic and write it out to a Delta table.

Solution

We will get started as follows:

  1. First, we will import our libraries. Also, note that you must install confluent_kafka first using pip:
    from confluent_kafka.schema_registry import SchemaRegistryClient
    import ssl
    from pyspark.sql.functions import from_json
    from pyspark.sql.functions  import udf, col, expr
    from pyspark.sql.types import StringType
  2. Here, we set up our variables for our initial Kafka connection:
    kafka_cluster = "see confluent website"
    kafka_api_key = "see confluent website"
    kafka_api_secret = "see confluent website"
    kafka_topic = "chapter_5"
    boostrap_server = "see confluent website"
    schema_reg_url = "see confluent website"
    schema_api_key = "see confluent website"
    schema_api_secret = "see confluent website"
  3. We will create a UDF to convert the value into a string:
    binary_to_string = udf(lambda x: str(int.from_bytes(x, byteorder='big')), StringType())
  4. Now, we will...

Summary

So, as our Confluent Kafka Chapter approaches, let’s reflect on where we have gone and what we have done. We reviewed the fundamentals of Kafka’s architecture and how to set up Confluent Kafka. We looked at writing producers and consumers and working with Schema Registry and Connect. Lastly, we looked at integrating with Spark and Delta Lake. Kafka is an essential component of streaming data. Streaming data has become an in-demand skill and an important technique. We will delve deep into machine learning operations (MLOps) and several other AI technologies as we advance.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architectures with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp