Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 13: Streaming Data with Apache Kafka

Apache Kafka opens up the world of real-time data streams. While there are fundamental differences in stream processing and batch processing, how you build data pipelines will be very similar. Understanding the differences between streaming data and batch processing will allow you to build data pipelines that take these differences into account.

In this chapter, we're going to cover the following main topics:

  • Understanding logs
  • Understanding how Kafka uses logs
  • Building data pipelines with Kafka and NiFi
  • Differentiating stream processing from batch processing
  • Producing and consuming with Python

Understanding logs

If you have written code, you may be familiar with software logs. Software developers use logging to write output from applications to a text file to store different events that happen within the software. They then use these logs to help debug any issues that arise. In Python, you have probably implemented code similar to the following code:

import logging
logging.basicConfig(level=0,filename='python-log.log', filemode='w', format='%(levelname)s - %(message)s')
logging.debug('Attempted to divide by zero')
logging.warning('User left field blank in the form')
logging.error('Couldn't find specified file')

The preceding code is a basic logging example that logs different levels – debug, warning, and error – to a file named python-log.log. The code will produce the following output:

DEBUG - Attempted to divide by zero
WARNING - User left field blank in the form
ERROR - Couldn&apos...

Understanding how Kafka uses logs

Kafka maintains logs that are written to by producers and read by consumers. The following sections will explain topics, consumers, and producers.

Topics

Apache Kafka uses logs to store data – records. Logs in Kafka are called topics. A topic is like a table in a database. In the previous chapter, you tested your Kafka cluster by creating a topic named dataengineering. The topic is saved to disk as a log file. Topics can be a single log, but usually they are scaled horizontally into partitions. Each partition is a log file that can be stored on another server. In a topic with partitions, the message order guarantee no longer applies to the topic, but only each partition. The following diagram shows a topic split into three partitions:

Figure 13.2 – A Kafka topic with three partitions

The preceding topic – Transactions – has three partitions labeled P1, P2, and P3. Within each partition, the...

Building data pipelines with Kafka and NiFi

To build a data pipeline with Apache Kafka, you will need to create a producer since we do not have any production Kafka clusters to connect to. With the producer running, you can read the data like any other file or database.

The Kafka producer

The Kafka producer will take advantage of the production data pipeline from Chapter 11, Project — Building a Production Data Pipeline. The producer data pipeline will do little more than send the data to the Kafka topic. The following screenshot shows the completed producer data pipeline:

Figure 13.7 – The NiFi data pipeline

To create the data pipeline, perform the following steps:

  1. Open a terminal. You need to create the topic before you can send messages to it in NiFi. Enter the following command:
    bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 3 --topic users

    The preceding command is slightly different...

Differentiating stream processing from batch processing

While the processing tools don't change whether you are processing streams or batches, there are two things you should keep in mind while processing streams – unbounded and time.

Data can be bounded or unbounded. Bounded data has an end, whereas unbounded data is constantly created and is possibly infinite. Bounded data is last year's sales of widgets. Unbounded data is a traffic sensor counting cars and recording their speeds on the highway.

Why is this important in building data pipelines? Because with bounded data, you will know everything about the data. You can see it all at once. You can query it, put it in a staging environment, and then run Great Expectations on it to get a sense of the ranges, values, or other metrics to use in validation as you process your data.

With unbounded data, it is streaming in and you don't know what the next piece of data will look like. This doesn't mean...

Producing and consuming with Python

You can create producers and consumers for Kafka using Python. There are multiple Kafka Python libraries – Kafka-Python, PyKafka, and Confluent Python Kafka. In this section, I will use Confluent Python Kafka, but if you want to use an open source, community-based library, you can use Kafka-Python. The principles and structure of the Python programs will be the same no matter which library you choose.

To install the library, you can use pip. The following command will install it:

pip3 install confluent-kafka

Once the library has finished installing, you can use it by importing it into your applications. The following sections will walk through writing a producer and consumer.

Writing a Kafka producer in Python

To write a producer in Python, you will create a producer, send data, and listen for acknowledgements. In the previous examples, you used Faker to create fake data about people. You will use it again to generate the data...

Summary

In this chapter, you learned the basics of Apache Kafka – from what is a log and how Kafka uses it, to partitions, producers, and consumers. You learned how Apache NiFi can create producers and consumers with a single processor. The chapter took a quick detour to explain how streaming data is unbounded and how time and windowing work with streams. These are important considerations when working with streaming data and can result in errors if you assume you have all the data at one time. Lastly, you learned how to use Confluent Python Kafka to write basic producers and consumers in Python.

Equipped with these skills, the next chapter will show you how to build a real-time data pipeline.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard