You're reading from Data Ingestion with Python Cookbook

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781837632602

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Data Engineering

Author (1):

Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface

Part 1: Fundamentals of Data Ingestion

Chapter 1: Introduction to Data Ingestion

Chapter 2: Principals of Data Access – Accessing Your Data

Chapter 3: Data Discovery – Understanding Our Data before Ingesting It

Chapter 4: Reading CSV and JSON Files and Solving Problems

Chapter 5: Ingesting Data from Structured and Unstructured Databases

Chapter 6: Using PySpark with Deﬁned and Non-Deﬁned Schemas

Chapter 7: Ingesting Analytical Data

Part 2: Structuring the Ingestion Pipeline

Chapter 8: Designing Monitored Data Workﬂows

Chapter 9: Putting Everything Together with Airﬂow

Chapter 10: Logging and Monitoring Your Data Ingest in Airﬂow

Chapter 11: Automating Your Data Ingestion Pipelines

Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime

Index

Why subscribe?

Other Books You May Enjoy

Designing Monitored Data Workﬂows

Logging code is a good practice that allows developers to debug faster and provide maintenance more effectively for applications or systems. There is no strict rule when inserting logs, but knowing when not to spam your monitoring or alerting tool while using it is excellent. Creating several logging messages unnecessarily will obfuscate the instance when something significant happens. That’s why it is crucial to understand the best practices when inserting logs into code.

This chapter will show how to create efficient and well-formatted logs using Python and PySpark for data pipelines with practical examples that can be applied in real-world projects.

In this chapter, we have the following recipes:

Inserting logs
Using log-level types
Creating standardized logs
Monitoring our data ingest ﬁle size
Logging based on data
Retrieving SparkSession metrics

Technical requirements

You can find the code from this chapter in the GitHub repository at https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Inserting logs

As mentioned in the introduction of this chapter, adding logging functionality to your applications is essential for debugging or making improvements later on. However, creating several log messages without necessity may generate confusion or even cause us to miss crucial alerts. In any case, knowing which kind of message to show is indispensable.

This recipe will cover how to create helpful log messages using Python and when to insert them.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command to check it on your command-line interface (CLI):

$ python3 –-version
Python 3.8.10

The following code execution can be done on a Python shell or a Jupyter notebook.

How to do it…

To perform this exercise, we will make a function that reads and returns the first line of a CSV file using the best logging practices. Here is how we do it:

First, let’s import the...

Using log-level types

Now that we have been introduced to how and where to insert logs, let’s understand log types or levels. Each log level has its own degree of relevance inside any system. For instance, the console output does not show debug messages by default.

We already covered how to log levels using PySpark in the Inserting formatted SparkSession logs to facilitate your work recipe in Chapter 6. Now we will do the same using only Python. This recipe aims to show how to set logging levels at the beginning of your script and insert the different levels inside your code to create a hierarchy of priority for your logs. With this, you can create a structured script that allows you or your team to monitor and identify errors.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command on your CLI to check your version:

$ python3 –-version
Python 3.8.10

The following code execution can be...

Creating standardized logs

Now that we know the best practices for inserting logs and using log levels, we can add more relevant information to our logs to help us monitor our code. Information such as date and time or the module or function executed helps us determine where an issue occurred or where improvements are required.

Creating standardized formatting for application logs or (in our case) data pipeline logs makes the debugging process more manageable, and there are a variety of ways to do this. One way of doing it is to create .ini or .conf files that hold the configuration on how the logs will be formatted and applied to our wider Python code, for instance.

In this recipe, we will learn how to create a configuration file that will dictate how the logs will be formatted across the code and shown in the execution output.

Getting ready

Let’s use the same code as the previous Using log-level types recipe, but with more improvements!

You can use the following...

Monitoring our data ingest ﬁle size

When ingesting data, we can track a few items to ensure the incoming information is what we expect. One of the most important of these items is the data size we are ingesting, which can mean file size or the size of chunks of streaming data.

Logging the size of incoming data allows the creation of intelligent and efficient monitoring. If at some point the size of incoming data diverges from what is expected, we can take action to investigate and resolve the issue.

In this recipe, we will create simple Python code that logs the size of ingested files, which is very valuable in data monitoring.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command on your CLI to check your version:

$ python3 –-version
Python 3.8.10

The following code execution can be done using a Python shell or a Jupyter notebook.

How to do it…

This exercise will create...

Logging based on data

As mentioned in the Monitoring our data ingest ﬁle size recipe, logging our ingest is a good practice in the data field. There are several ways to explore our ingestion logs to increase the process’s reliability and our confidence in it. In this recipe, we will start to get into the data operations field (or DataOps), where the goal is to track the behavior of data from the source until it reaches its final destination.

This recipe will explore other metrics we can track to create a reliable data pipeline.

Getting ready

For this exercise, let’s imagine we have two simple data ingests, one from a database and another from an API. Since this is a straightforward pipeline, let’s visualize it with the following diagram:

Figure 8.16 – Data ingestion phases

With this in mind, let’s explore the instances we can log to make monitoring efficient.

How to do it…

Let’s define...

Retrieving SparkSession metrics

Until now, we created our logs to provide more information and be more useful for monitoring. Logging allows us to build customized metrics based on the necessity of our pipeline and code. However, we can also take advantage of built-in metrics from frameworks and programming languages.

When we create a SparkSession, it provides a web UI with useful metrics that can be used to monitor our pipelines. Using this, the following recipe shows you how to access and retrieve metric information from SparkSession, and use it as a tool when ingesting or processing a DataFrame.

Getting ready

You can execute this recipe using the PySpark command line or the Jupyter Notebook.

Before exploring the Spark UI metrics, let’s create a simple SparkSession using the following code:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      ...

You're reading from Data Ingestion with Python Cookbook

Table of Contents (17) Chapters

Designing Monitored Data Workﬂows

Technical requirements

Inserting logs

Getting ready

How to do it…

Using log-level types

Getting ready

Creating standardized logs

Getting ready

Monitoring our data ingest ﬁle size

Getting ready

How to do it…

Logging based on data

Getting ready

How to do it…

Retrieving SparkSession metrics

Getting ready

Further reading

Authors (1)

Personalised recommendations for you

You're reading from Data Ingestion with Python Cookbook

Table of Contents (17) Chapters

Designing Monitored Data Workﬂows

Technical requirements

Inserting logs

Getting ready

How to do it…

Using log-level types

Getting ready

Creating standardized logs

Getting ready

Monitoring our data ingest ﬁle size

Getting ready

How to do it…

Logging based on data

Getting ready

How to do it…

Retrieving SparkSession metrics

Getting ready

Further reading

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you