Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Ingestion with Python Cookbook

You're reading from  Data Ingestion with Python Cookbook

Product type Book
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Pages 414 pages
Edition 1st Edition
Languages
Author (1):
Gláucia Esppenchutz Gláucia Esppenchutz
Profile icon Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface Part 1: Fundamentals of Data Ingestion
Chapter 1: Introduction to Data Ingestion Chapter 2: Principals of Data Access – Accessing Your Data Chapter 3: Data Discovery – Understanding Our Data before Ingesting It Chapter 4: Reading CSV and JSON Files and Solving Problems Chapter 5: Ingesting Data from Structured and Unstructured Databases Chapter 6: Using PySpark with Defined and Non-Defined Schemas Chapter 7: Ingesting Analytical Data Part 2: Structuring the Ingestion Pipeline
Chapter 8: Designing Monitored Data Workflows Chapter 9: Putting Everything Together with Airflow Chapter 10: Logging and Monitoring Your Data Ingest in Airflow Chapter 11: Automating Your Data Ingestion Pipelines Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime Index Other Books You May Enjoy

Designing Monitored Data Workflows

Logging code is a good practice that allows developers to debug faster and provide maintenance more effectively for applications or systems. There is no strict rule when inserting logs, but knowing when not to spam your monitoring or alerting tool while using it is excellent. Creating several logging messages unnecessarily will obfuscate the instance when something significant happens. That’s why it is crucial to understand the best practices when inserting logs into code.

This chapter will show how to create efficient and well-formatted logs using Python and PySpark for data pipelines with practical examples that can be applied in real-world projects.

In this chapter, we have the following recipes:

  • Inserting logs
  • Using log-level types
  • Creating standardized logs
  • Monitoring our data ingest file size
  • Logging based on data
  • Retrieving SparkSession metrics

Technical requirements

You can find the code from this chapter in the GitHub repository at https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Inserting logs

As mentioned in the introduction of this chapter, adding logging functionality to your applications is essential for debugging or making improvements later on. However, creating several log messages without necessity may generate confusion or even cause us to miss crucial alerts. In any case, knowing which kind of message to show is indispensable.

This recipe will cover how to create helpful log messages using Python and when to insert them.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command to check it on your command-line interface (CLI):

$ python3 –-version
Python 3.8.10

The following code execution can be done on a Python shell or a Jupyter notebook.

How to do it…

To perform this exercise, we will make a function that reads and returns the first line of a CSV file using the best logging practices. Here is how we do it:

  1. First, let’s import the...

Using log-level types

Now that we have been introduced to how and where to insert logs, let’s understand log types or levels. Each log level has its own degree of relevance inside any system. For instance, the console output does not show debug messages by default.

We already covered how to log levels using PySpark in the Inserting formatted SparkSession logs to facilitate your work recipe in Chapter 6. Now we will do the same using only Python. This recipe aims to show how to set logging levels at the beginning of your script and insert the different levels inside your code to create a hierarchy of priority for your logs. With this, you can create a structured script that allows you or your team to monitor and identify errors.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command on your CLI to check your version:

$ python3 –-version
Python 3.8.10

The following code execution can be...

Creating standardized logs

Now that we know the best practices for inserting logs and using log levels, we can add more relevant information to our logs to help us monitor our code. Information such as date and time or the module or function executed helps us determine where an issue occurred or where improvements are required.

Creating standardized formatting for application logs or (in our case) data pipeline logs makes the debugging process more manageable, and there are a variety of ways to do this. One way of doing it is to create .ini or .conf files that hold the configuration on how the logs will be formatted and applied to our wider Python code, for instance.

In this recipe, we will learn how to create a configuration file that will dictate how the logs will be formatted across the code and shown in the execution output.

Getting ready

Let’s use the same code as the previous Using log-level types recipe, but with more improvements!

You can use the following...

Monitoring our data ingest file size

When ingesting data, we can track a few items to ensure the incoming information is what we expect. One of the most important of these items is the data size we are ingesting, which can mean file size or the size of chunks of streaming data.

Logging the size of incoming data allows the creation of intelligent and efficient monitoring. If at some point the size of incoming data diverges from what is expected, we can take action to investigate and resolve the issue.

In this recipe, we will create simple Python code that logs the size of ingested files, which is very valuable in data monitoring.

Getting ready

We will use only Python code. Make sure you have Python version 3.7 or above. You can use the following command on your CLI to check your version:

$ python3 –-version
Python 3.8.10

The following code execution can be done using a Python shell or a Jupyter notebook.

How to do it…

This exercise will create...

Logging based on data

As mentioned in the Monitoring our data ingest file size recipe, logging our ingest is a good practice in the data field. There are several ways to explore our ingestion logs to increase the process’s reliability and our confidence in it. In this recipe, we will start to get into the data operations field (or DataOps), where the goal is to track the behavior of data from the source until it reaches its final destination.

This recipe will explore other metrics we can track to create a reliable data pipeline.

Getting ready

For this exercise, let’s imagine we have two simple data ingests, one from a database and another from an API. Since this is a straightforward pipeline, let’s visualize it with the following diagram:

Figure 8.16 – Data ingestion phases

Figure 8.16 – Data ingestion phases

With this in mind, let’s explore the instances we can log to make monitoring efficient.

How to do it…

Let’s define...

Retrieving SparkSession metrics

Until now, we created our logs to provide more information and be more useful for monitoring. Logging allows us to build customized metrics based on the necessity of our pipeline and code. However, we can also take advantage of built-in metrics from frameworks and programming languages.

When we create a SparkSession, it provides a web UI with useful metrics that can be used to monitor our pipelines. Using this, the following recipe shows you how to access and retrieve metric information from SparkSession, and use it as a tool when ingesting or processing a DataFrame.

Getting ready

You can execute this recipe using the PySpark command line or the Jupyter Notebook.

Before exploring the Spark UI metrics, let’s create a simple SparkSession using the following code:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      ...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023 Publisher: Packt ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}