Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Ingestion with Python Cookbook

You're reading from  Data Ingestion with Python Cookbook

Product type Book
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Pages 414 pages
Edition 1st Edition
Languages
Author (1):
Gláucia Esppenchutz Gláucia Esppenchutz
Profile icon Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface Part 1: Fundamentals of Data Ingestion
Chapter 1: Introduction to Data Ingestion Chapter 2: Principals of Data Access – Accessing Your Data Chapter 3: Data Discovery – Understanding Our Data before Ingesting It Chapter 4: Reading CSV and JSON Files and Solving Problems Chapter 5: Ingesting Data from Structured and Unstructured Databases Chapter 6: Using PySpark with Defined and Non-Defined Schemas Chapter 7: Ingesting Analytical Data Part 2: Structuring the Ingestion Pipeline
Chapter 8: Designing Monitored Data Workflows Chapter 9: Putting Everything Together with Airflow Chapter 10: Logging and Monitoring Your Data Ingest in Airflow Chapter 11: Automating Your Data Ingestion Pipelines Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime Index Other Books You May Enjoy

Automating Your Data Ingestion Pipelines

Data sources are frequently updated, and this requires us to update our data lake. However, with multiple sources or projects, it becomes impossible to trigger data pipelines manually. Data pipeline automation makes ingesting and processing data mechanical, obviating the human actions to trigger it. The importance of automation configuration lies in the ability to streamline data flow and improve data quality, reducing errors and inconsistency.

In this chapter, we will cover how to automate the data ingestion pipelines in Airflow, along with two essential topics in data engineering, data replication and historical data ingestion, as well as best practices.

In this chapter, we will cover the following recipes:

  • Scheduling daily ingestions
  • Scheduling historical data ingestion
  • Scheduling data replication
  • Setting up the schedule_interval parameter
  • Solving scheduling errors

Technical requirements

You can find the code from this chapter in the GitHub repository at https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook/tree/main/Chapter_11.

Installing and running Airflow

This chapter requires that Airflow is installed on your local machine. You can install it directly on your Operating System (OS) or use a Docker image. For more information, refer to the Configuring Docker for Airflow recipe in Chapter 1.

After following the steps described in Chapter 1, ensure your Airflow instance runs correctly. You can do that by checking the Airflow UI at http://localhost:8080.

If you are using a Docker container (as I am) to host your Airflow application, you can check its status in the terminal with the following command:

$ docker ps

Here is the status of the container:

Figure 11.1 –  Airflow containers running

Figure 11.1 – Airflow containers running

Or you can check the container status on Docker Desktop:

Figure 11.2 – Docker Desktop showing Airflow running containers

Figure 11.2 – Docker Desktop showing Airflow running containers

Scheduling daily ingestions

Data constantly changes in our dynamic world, with new information being added every day and even every second. Therefore, it is crucial to regularly update our data lake to reflect the latest scenarios and information.

Managing multiple projects or pipelines concurrently and manually triggering them while integrating new data from various sources can be daunting. To alleviate this issue, we can rely on schedulers, and Airflow provides a straightforward solution for this purpose.

In this recipe, we will create a simple Directed Acyclic Graph (DAG) in Airflow and explore how to use its parameters to schedule a pipeline to run daily.

Getting ready

Please refer to the Technical requirements section for this recipe since we will handle it with the same technology mentioned here.

In this exercise, we will create a simple DAG. The structure of your Airflow folder should look like the following:

Figure 11.3 – daily_ingestion_dag DAG folder structure

Figure 11.3 – daily_ingestion_dag...

Scheduling historical data ingestion

Historical data is vital for data-driven decisions, providing valuable insights and supporting decision-making processes. It can also refer to data that has been accumulated over a period of time. For example, a sales company can use historical data from previous marketing campaigns to see how they have influenced the sales of a specific product over the years.

This exercise will show how to create a scheduler in Airflow to ingest historical data using the best practices and common concerns related to this process.

Getting ready

Please refer to the Technical requirements section for this recipe since we will handle it with the same technology mentioned here.

In this exercise, we will create a simple DAG inside our DAGs folder. The structure of your Airflow folder should look like the following:

Figure 11.6 – historical_data_dag folder structure in your local Airflow directory

Figure 11.6 – historical_data_dag folder structure in your local Airflow directory

How to do it…

...

Scheduling data replication

In the first chapter of this book, we covered what data replication is and why it’s important. We saw how vital this process is in the prevention of data loss and in promoting recovery from disasters.

Now, it is time to learn how to create an optimized schedule window to make data replication happen. In this recipe, we will create a diagram to help us decide the best moment to replicate our data.

Getting ready

This exercise does not require technical preparation. However, to make it closer to a real scenario, let’s imagine we need to decide the best way to ensure the data from a hospital is being adequately replicated.

We will have two pipelines: one holding patient information and another with financial information. The first pipeline collects information from a patient database and synthesizes it into readable reports used by the medical team. The second will feed an internal dashboard used by the hospital executives.

Due to...

Setting up the schedule_interval parameter

One of the most widely used parameters in Airflow DAG scheduler configuration is schedule_interval. Together with start_date, it creates a dynamic and continuous trigger for the pipeline. However, there are some small details we need to pay attention to when setting schedule_interval.

This recipe will cover different forms to set up the schedule_interval parameter. We will also explore a practical example to see how the scheduling window works in Airflow, making it more straightforward to manage pipeline executions.

Getting ready

While this exercise does not require any technical preparation, it is recommended to take notes about when the pipeline is supposed to start and the interval between each trigger.

How to do it…

Here, we will show only the default_args dictionary to avoid code redundancy. However, you can always check out the complete code in the GitHub repository: https://github.com/PacktPublishing/Data-Ingestion...

Solving scheduling errors

At this point, you may have already experienced some issues with scheduling pipelines not being triggered as expected. If not, don’t worry; it will happen sometime and is totally normal. With several pipelines running in parallel, in different windows, or attached to different timezones, it is expected to be entangled with one or another.

To avoid this entanglement, in this exercise, we will create a diagram to assist in the debugging process, identify the possible causes of a scheduler not working correctly in Airflow, and see how to solve it.

Getting ready

This recipe does not require any technical preparation. Nevertheless, taking notes and writing down the steps we will follow here can be helpful. Writing when learning something new can help to fix the knowledge in our minds, making it easier to remember later.

Back to our exercise; scheduler errors in Airflow typically give the DAG status None, as shown here:

Figure 11.15 – DAG in the Airflow UI with an error in the scheduler
...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023 Publisher: Packt ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}