You're reading from Modern Data Architectures with Python

Product type Book

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Pages 318 pages

Edition 1st Edition

Languages

Python

Concepts

Data Science

Author (1):

Brian Lipp

Table of Contents (19) Chapters

Preface

1. Part 1:Fundamental Data Knowledge

2. Chapter 1: Modern Data Processing Architecture

3. Chapter 2: Understanding Data Analytics

4. Part 2: Data Engineering Toolset

5. Chapter 3: Apache Spark Deep Dive

6. Chapter 4: Batch and Stream Data Processing Using PySpark

7. Chapter 5: Streaming Data with Kafka

8. Part 3:Modernizing the Data Platform

9. Chapter 6: MLOps

10. Chapter 7: Data and Information Visualization

11. Chapter 8: Integrating Continous Integration into Your Workflow

12. Chapter 9: Orchestrating Your Data Workflows

13. Part 4:Hands-on Project

14. Chapter 10: Data Governance

15. Chapter 11: Building out the Groundwork

16. Chapter 12: Completing Our Project

17. Index

Why subscribe?

18. Other Books You May Enjoy

Batch and Stream Data Processing Using PySpark

When setting up your architecture, you decided whether to support batch or streaming, or both. This chapter will go through the ins and outs of batches and streaming with Apache Spark using Python. Spark can be your go-to tool for moving and processing data at scale. We will also discuss the ins and outs of DataFrames and how to use them in both types of data processing.

In this chapter, we’re going to cover the following main topics:

Batch processing
Working with schemas
User Defined Function
Stream processing

Technical requirements

The tooling that will be used in this chapter is tied to the tech stack that was chosen for the book. All vendors should offer a free trial account.

I will be using the following:

Databricks
AWS or Azure

Setting up your environment

Before we begin this chapter, let’s take some time to set up our working environment.

Python, AWS, and Databricks

As in the previous chapters, this chapter assumes you have a working version of Python 3.6+ installed in your development environment. We will also assume you have set up an AWS account and have set up Databricks with that AWS account.

Databricks CLI

The first step is to install the databricks-cli tool using the pip Python package manager:

pip install databricks-cli

Let’s validate that everything has been installed correctly. If the following command produces the tool version, then everything is working correctly:

Databricks –v

Now, let’s set up authentication. First, go into the Databricks UI and generate a personal access token. The following command will ask for the host that was created for your Databricks instance and the created token:

databricks configure –token

We can determine...

Batch processing

Batch-processing data is the most common form of data processing, and for most companies, it is their bread-and-butter approach to data. Batch processing is the method of data processing that is done at a “triggered” pace. This trigger may be manual or based on a schedule. Streaming, on the other hand, involves attempting to trigger something very quickly. This is also known as micro-batch processing. Streaming can exist in different ways on different systems. In Spark, streaming is designed to look and work like batch processing but without the need to constantly trigger the job.

In this section, we will set up some fake data for our examples using the Faker Python library. Faker will only be used for example purposes since it’s very important to the learning process. If you prefer an alternative way to generate data, please feel free to use that instead:

from faker import Faker
import pandas as pd
import random
fake = Faker()
def generate_data...

Spark schemas

Spark only supports schema on read and write, so you will likely find it necessary to define your schema manually. Spark has many data types. Once you know how to represent schemas, it becomes rather easy to create data structures.

One thing to keep in mind is that when you define a schema in Spark, you must also set its nullability. When a column is allowed to have nulls, then we can set it to True; by doing this, when a Null or empty field is present, no errors will be thrown by Spark. When we define a Struct field, we set three main components: the name, the data type, and the nullibility. When we set the nullability to False, Spark will throw an error when data is added to the DataFrame. It can be useful to limit nulls when defining the schema but keep in mind that throwing an error isn’t always the ideal reaction at every stage of a data pipeline.

When working with data pipelines, the discussion about dynamic schema and static schema will often come...

The UDF

One very powerful tool to consider using with Spark is the UDF. UDFs are custom per-row transformations in native Python that run in parallel on your data. The obvious question is, why not only use UDFs? After all, they are also more flexible. There is a hierarchy of tools you should look to use for speed reasons. Speed is a significant consideration and should not be ignored. Ideally, you should get the most bang for your buck using Python DataFrame APIs and their native functions/methods. DataFrames go through many optimizations, so they are ideally suited for semi-structured and structured data. The methods and functions Spark provides are also heavily optimized and designed for the most common data processing tasks. Suppose you find a case where you just can’t do what is required with the native functions and methods and you are forced to write UDFs. UDFs are slower because Spark can’t optimize them. They take your native language code and serialize it into...

Stream processing

Streaming is a very useful mode of processing data and can come with a large amount of complexity. One thing a purist must consider is that Spark doesn’t do “streaming data” – Spark does micro-batch data processing. So, it will load whatever the new messages are and run a batch process on them in a continuous loop while checking for new data. A pure streaming data processing engine such as Apache Flink will only process one new load of “data.” So, as a simple example, let’s say there are 100 new messages in a Kafka queue; Spark would process all of them in one micro-batch. Flink, on the other hand, would process each message separately.

Spark Structured Streaming is a DataFrame API on top of the normal Spark Streaming, much like the DataFrame API sits on the RDD API. Streaming DataFrames are optimized just like normal DataFrames, so I suggest always using structured streaming over normal Spark Streaming. Also, Spark...

Practical lab

Your team has been given a new data source to deliver Parquet files to dbfs. These files could come every minute or once daily; the rate and speed will vary. This data must be updated once every hour if any new data has been delivered.

Setup

Let’s set up our environment and create some fake data using Python.

Setting up folders

The following code can be run in a notebook. Here, I am using the shell magic to accomplish this:

%sh
rm -rf /dbfs/tmp/chapter_4_lab_test_data
rm -rf /dbfs/tmp/chapter_4_lab_bronze
rm -rf /dbfs/tmp/chapter_4_lab_silver
rm -rf /dbfs/tmp/chapter_4_lab_gold

Creating fake data

Use the following code to create fake data for our problems:

fake = Faker()
def generate_data(num):
    row = [{"name":fake.name(),
           "address":fake.address(),
           "city"...

Solution

Now, let’s explore the solutions to the aforementioned problems.

Solution 1

Here, we’re setting up the initial read of our test data. We are dynamically loading the schema from the file. In test cases, this is fine, but in production workloads, this is a very bad practice. I recommend statically writing out your schema:

location = "dbfs:/tmp/chapter_4_lab_test_data"
fmt = "parquet"
schema = spark.read.format(fmt).load(location).schema
users = spark.readStream.schema(schema).format(fmt).load(location)

Now, the code will populate a bronze table from the data being loaded, and the write process will be appended:

bronze_schema = users.schema
bronze_location = "dbfs:/tmp/chapter_4_lab_bronze"
checkpoint_location = f"{bronze_location}/_checkpoint"
output_mode = "append"
bronze_query = users.writeStream.format("delta").trigger(once=True).option("checkpointLocation", bronze_location...

Summary

We have come a long way, friends! In this chapter, we covered batch-processing data, as well as streaming data. We embarked on a comprehensive journey through the world of data processing in Apache Spark with Python. We explored both batch processing and streaming data processing techniques, uncovering the strengths and nuances of each approach.

The chapter began with a deep dive into batch processing, where data is processed in fixed-sized chunks. We learned how to work with DataFrames in Spark, perform transformations and actions, and leverage optimizations for efficient data processing.

Moving on to the fascinating realm of stream processing, we learned about the nuances of Spark Structured Streaming, which enables the continuous processing of real-time data streams. Understanding the distinction between micro-batch processing and true streaming clarified how Spark processes streaming data effectively. This chapter highlighted the importance of defining schemas and...