You're reading from Modern Data Architectures with Python

Product typeBook

Published inSep 2023

Reading LevelExpert

PublisherPackt

ISBN-139781801070492

Edition1st Edition

Languages

Python

Concepts

Data Science

Author (1)

Brian Lipp

Streaming Data with Kafka

There are several streaming platforms on the market, but Apache Kafka is the front-runner. Kafka is an open source project like Spark but focuses on being a distributed message system. Kafka is used for several applications, including microservices and data engineering. Confluent is the largest contributor to Apache Kafka and offers several offerings in the ecosystem, such as Hosted Kafka, Schema Registry, Kafka Connect, and the Kafka REST API, among others. We will go through several areas of Confluent for Kafka, focusing on data processing and movement.

In this chapter, we will cover the following main topics:

Kafka architecture
Setting Confluent Kafka
Kafka streams
Schema Registry
Spark and Kafka
Kafka Connect

Technical requirements

The tooling used in this chapter is tied to the tech stack chosen for the book. All vendors should offer a free trial account.

I will use the following:

Databricks
Confluent Kafka

Setting up your environment

Before we begin our chapter, let’s take the time to set up our working environment.

Python, AWS, and Databricks

As we have with many others, this chapter assumes you have a working Python 3.6+ release installed in your development environment. We will also assume you have set up an AWS account and have set up Databricks with that account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If this command produces the tool version, then everything is working correctly:

Databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host created for your Databricks instance and the created token:

databricks configure –token

We can determine whether the CLI is...

Confluent Kafka

Here, we will look at getting a free account and setting up an initial cluster. Currently, Confluent offers $400 of credit to all new accounts, so we should ideally have no costs for our labs, but we will look for cost savings as we move forward.

Signing up

The signup page is https://www.confluent.io/get-started/, and it allows for several authentication options, including Gmail and GitHub, asking you for personal information such as your name and your company’s name.

You will be presented with the cluster creation page.

Figure 5.1: The cluster creation page

Here, we will use the Basic cluster, which is free at the time of writing.

The next screen allows us to choose a cloud platform and region.

Figure 5.2: Choosing the platform to use

Here, I choose AWS, Northern Virginia, and a single zone. This is a good choice for our labs, but this would need to be more complex for a production system...

Kafka architecture

Kafka is an open source distributed streaming platform designed to scale to impressive levels. Kafka can store as much data as you have storage for, but it shouldn’t be used as a database. Kafka’s core architecture is composed of five main ideas – topics, brokers, partitions, producers, and consumers.

Topics

For a developer, a Kafka topic is the most important concept to understand. Topics are where data is “stored.” Topics hold data often called events, which means that the data has a key and a value. Keys in this context are not related to a database key that defines uniqueness, but they can be used for organizational purposes. The value is the data itself, which can be in a few different formats such as strings, JSON, Avro, and Protobuf. When your data is written to Kafka, it will have metadata; the most important will be the timestamp.

Working with Kafka can be confusing because the data isn’t stored in a database...

Schema Registry

Kafka guarantees the delivery of events sent from producers, but it does not attempt to guarantee quality. Kafka assumes that your applications can coordinate quality data between consumers and producers. On the surface, this seems reasonable and easy to accomplish. The reality is that even in ideal situations, this type of assumed coordination is unrealistic. This type of problem is common among data producers and consumers; the solution is to enforce a data contract.

The general rule of thumb is, garbage in, garbage out. Confluent Schema Registry is an attempt at building contracts for your data schema in Kafka. Confluent Schema Registry is a layer that sits in front of Kafka and stands as the gatekeeper to Kafka. Events can’t be produced for a topic unless Confluent Schema Registry first gives its blessing. Consumers can know exactly what they will get by checking the Confluent Schema Registry first.

This process happens behind the scenes, and the Confluent...

Kafka Connect

Confluent noticed that many tasks are almost cookie-cutter and could be shared across projects and clients. One example might be listening to a topic and copying it into a Snowflake table. Confluent created Connect to allow engineers to create very generic tasks and deploy them to a cluster to run those tasks. If you need a source or sink connector that doesn’t exist, you can build your own. Connect is a very robust platform that can be a major pipeline component. Connect also offers some basic transformations, which can be very useful. Confluent Cloud offers a hosted connect cluster the same way it offers Schema Registry and Kafka Core.

To access Connect in Confluent Cloud, we will navigate to our cluster and then look for the Connectors option in the menu bar.

Figure 5.10: The Connectors option in the menu bar

Next, we can set up a simple connector that creates fake data. We will then have the option to add a new connector.

...

Spark and Kafka

Spark has a long history of supporting Kafka with both streaming and batch processing. Here, we will go over some of the structured streaming Kafka-related APIs.

Here, we have a streaming read of a Kafka cluster. It will return a streaming DataFrame:

df = spark \
  .readStream \
  .format("kafka")\
  .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\
  .option("subscribe", "<topic>")\
  .load()\\

Conversely, if you want to do a true batch process, you can also read from Kafka. Keep in mind that we have covered techniques to create a streaming context but using a batch style to avoid rereading messages:

df = spark \
  .read \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "<host>:<port>, <host>:<port>")\
  ...

Practical lab

In these practical labs, we will create common tasks that involve integrating Kafka and Spark to build Delta tables:

Connect to Confluent Kafka using Spark. Use Schema Registry to create a Spark job. This should ingest data from a topic and write to a delta table.
Use the Delta Sink Connector to finalize the table.

It is your task to write a Spark job that ingests events from a Confluent Kafka topic and write it out to a Delta table.

Solution

We will get started as follows:

First, we will import our libraries. Also, note that you must install confluent_kafka first using pip:

from confluent_kafka.schema_registry import SchemaRegistryClient

import ssl

from pyspark.sql.functions import from_json

from pyspark.sql.functions  import udf, col, expr

from pyspark.sql.types import StringType

Here, we set up our variables for our initial Kafka connection:

kafka_cluster = "see confluent website"

kafka_api_key = "see confluent website"

kafka_api_secret = "see confluent website"

kafka_topic = "chapter_5"

boostrap_server = "see confluent website"

schema_reg_url = "see confluent website"

schema_api_key = "see confluent website"

schema_api_secret = "see confluent website"

We will create a UDF to convert the value into a string:

binary_to_string = udf(lambda x: str(int.from_bytes(x, byteorder='big')), StringType())

Now, we will...

Summary

So, as our Confluent Kafka Chapter approaches, let’s reflect on where we have gone and what we have done. We reviewed the fundamentals of Kafka’s architecture and how to set up Confluent Kafka. We looked at writing producers and consumers and working with Schema Registry and Connect. Lastly, we looked at integrating with Spark and Delta Lake. Kafka is an essential component of streaming data. Streaming data has become an in-demand skill and an important technique. We will delve deep into machine learning operations (MLOps) and several other AI technologies as we advance.

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architectures with Python

Published in: Sep 2023Publisher: PacktISBN-13: 9781801070492

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages