You're reading from Data Ingestion with Python Cookbook

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781837632602

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Data Engineering

Author (1):

Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface

1. Part 1: Fundamentals of Data Ingestion

2. Chapter 1: Introduction to Data Ingestion

3. Chapter 2: Principals of Data Access – Accessing Your Data

4. Chapter 3: Data Discovery – Understanding Our Data before Ingesting It

5. Chapter 4: Reading CSV and JSON Files and Solving Problems

6. Chapter 5: Ingesting Data from Structured and Unstructured Databases

7. Chapter 6: Using PySpark with Deﬁned and Non-Deﬁned Schemas

8. Chapter 7: Ingesting Analytical Data

9. Part 2: Structuring the Ingestion Pipeline

10. Chapter 8: Designing Monitored Data Workﬂows

11. Chapter 9: Putting Everything Together with Airﬂow

12. Chapter 10: Logging and Monitoring Your Data Ingest in Airﬂow

13. Chapter 11: Automating Your Data Ingestion Pipelines

14. Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime

15. Index

Why subscribe?

16. Other Books You May Enjoy

Implementing data replication

Data replication is a process applied in data environments to create multiple copies of data and store them on different locations, servers, or sites. This technique is commonly implemented to create better availability and avoid data loss if there is downtime, or even a natural disaster that affects a data center.

Getting ready

You will find across papers and articles different types (or even names) on the best way for data replication decision. In this recipe, you will learn how to decide which kind of replication better suits your application or software.

How to do it…

Let’s begin to build our fundamental pillars to implement data replication:

First, we need to decide the size of our replication, and it can be done using a portion or all the stored data.
The next step is to consider when replication will take place. It can be done synchronously when new data arrives in storage or within a specific timeframe.
The last fundamental pillar is whether the data is incremented or in a bulk form.

In the end, we will have a diagram that looks like the following:

Figure 1.21 – A data replication model decision diagram

How it works…

Analyzing the preceding figure, we have three main questions to answer, regarding the extension, the frequency, and whether our replication will be incremental or bulk.

For the first question, we decide whether the replication will be complete or partial. In other words, either the data will consistently be replicated no matter what type of transaction or change was made, or just a portion of the data will be replicated. A real example of this would be keeping track of all store sales or just the most expensive ones.

The second question, related to the frequency, is to decide when a replication needs to be done. This question also needs to take into consideration related costs. Real-time replication is often more expensive, but the synchronicity guarantees almost no data inconsistency.

Lastly, it is relevant to consider how data will be transported to the replication site. In most cases, a scheduler with a script can replicate small data batches and reduce transportation costs. However, a bulk replication can be used in the data ingestion process, such as copying all the current batch’s raw data from a source to cold storage.

There’s more…

One method of data replication that has seen an increase in use in the past few years is cold storage, which is used to retain data used infrequently or is even inactive. The costs related to this type of replication are meager and guarantee data longevity. You can find cold storage solutions in all cloud providers, such as Amazon Glacier, Azure Cool Blob, and Google Cloud Storage Nearline.

Besides replication, regulatory compliance such as General Data Protection Regulation (GDPR) laws benefit from this type of storage, since, for some case scenarios, users’ data need to be kept for some years.

In this chapter, we explored the basic concepts and laid the foundation for the following chapters and recipes in this book. We started with a Python installation, prepared our Docker containers, and saw data governance and replication concepts. You will observe over the upcoming chapters that almost all topics interconnect, and you will understand the relevance of understanding them at the beginning of the ETL process.