You're reading from Data Engineering with Python

Product typeBook

Published inOct 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839214189

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Paul Crickard

Chapter 13: Streaming Data with Apache Kafka

Apache Kafka opens up the world of real-time data streams. While there are fundamental differences in stream processing and batch processing, how you build data pipelines will be very similar. Understanding the differences between streaming data and batch processing will allow you to build data pipelines that take these differences into account.

In this chapter, we're going to cover the following main topics:

Understanding logs
Understanding how Kafka uses logs
Building data pipelines with Kafka and NiFi
Differentiating stream processing from batch processing
Producing and consuming with Python

Understanding logs

If you have written code, you may be familiar with software logs. Software developers use logging to write output from applications to a text file to store different events that happen within the software. They then use these logs to help debug any issues that arise. In Python, you have probably implemented code similar to the following code:

import logging
logging.basicConfig(level=0,filename='python-log.log', filemode='w', format='%(levelname)s - %(message)s')
logging.debug('Attempted to divide by zero')
logging.warning('User left field blank in the form')
logging.error('Couldn't find specified file')

The preceding code is a basic logging example that logs different levels – debug, warning, and error – to a file named python-log.log. The code will produce the following output:

DEBUG - Attempted to divide by zero
WARNING - User left field blank in the form
ERROR - Couldn&apos...

Understanding how Kafka uses logs

Kafka maintains logs that are written to by producers and read by consumers. The following sections will explain topics, consumers, and producers.

Topics

Apache Kafka uses logs to store data – records. Logs in Kafka are called topics. A topic is like a table in a database. In the previous chapter, you tested your Kafka cluster by creating a topic named dataengineering. The topic is saved to disk as a log file. Topics can be a single log, but usually they are scaled horizontally into partitions. Each partition is a log file that can be stored on another server. In a topic with partitions, the message order guarantee no longer applies to the topic, but only each partition. The following diagram shows a topic split into three partitions:

Figure 13.2 – A Kafka topic with three partitions

The preceding topic – Transactions – has three partitions labeled P1, P2, and P3. Within each partition, the...

Building data pipelines with Kafka and NiFi

To build a data pipeline with Apache Kafka, you will need to create a producer since we do not have any production Kafka clusters to connect to. With the producer running, you can read the data like any other file or database.

The Kafka producer

The Kafka producer will take advantage of the production data pipeline from Chapter 11, Project — Building a Production Data Pipeline. The producer data pipeline will do little more than send the data to the Kafka topic. The following screenshot shows the completed producer data pipeline:

Figure 13.7 – The NiFi data pipeline

To create the data pipeline, perform the following steps:

Open a terminal. You need to create the topic before you can send messages to it in NiFi. Enter the following command:
```
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 3 --topic users
```
The preceding command is slightly different...

Differentiating stream processing from batch processing

While the processing tools don't change whether you are processing streams or batches, there are two things you should keep in mind while processing streams – unbounded and time.

Data can be bounded or unbounded. Bounded data has an end, whereas unbounded data is constantly created and is possibly infinite. Bounded data is last year's sales of widgets. Unbounded data is a traffic sensor counting cars and recording their speeds on the highway.

Why is this important in building data pipelines? Because with bounded data, you will know everything about the data. You can see it all at once. You can query it, put it in a staging environment, and then run Great Expectations on it to get a sense of the ranges, values, or other metrics to use in validation as you process your data.

With unbounded data, it is streaming in and you don't know what the next piece of data will look like. This doesn't mean...

Producing and consuming with Python

You can create producers and consumers for Kafka using Python. There are multiple Kafka Python libraries – Kafka-Python, PyKafka, and Confluent Python Kafka. In this section, I will use Confluent Python Kafka, but if you want to use an open source, community-based library, you can use Kafka-Python. The principles and structure of the Python programs will be the same no matter which library you choose.

To install the library, you can use pip. The following command will install it:

pip3 install confluent-kafka

Once the library has finished installing, you can use it by importing it into your applications. The following sections will walk through writing a producer and consumer.

Writing a Kafka producer in Python

To write a producer in Python, you will create a producer, send data, and listen for acknowledgements. In the previous examples, you used Faker to create fake data about people. You will use it again to generate the data...

Summary

In this chapter, you learned the basics of Apache Kafka – from what is a log and how Kafka uses it, to partitions, producers, and consumers. You learned how Apache NiFi can create producers and consumers with a single processor. The chapter took a quick detour to explain how streaming data is unbounded and how time and windowing work with streams. These are important considerations when working with streaming data and can result in errors if you assume you have all the data at one time. Lastly, you learned how to use Confluent Python Kafka to write basic producers and consumers in Python.

Equipped with these skills, the next chapter will show you how to build a real-time data pipeline.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Python

Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages