Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Ingestion with Python Cookbook

You're reading from  Data Ingestion with Python Cookbook

Product type Book
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Pages 414 pages
Edition 1st Edition
Languages
Author (1):
Gláucia Esppenchutz Gláucia Esppenchutz
Profile icon Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface Part 1: Fundamentals of Data Ingestion
Chapter 1: Introduction to Data Ingestion Chapter 2: Principals of Data Access – Accessing Your Data Chapter 3: Data Discovery – Understanding Our Data before Ingesting It Chapter 4: Reading CSV and JSON Files and Solving Problems Chapter 5: Ingesting Data from Structured and Unstructured Databases Chapter 6: Using PySpark with Defined and Non-Defined Schemas Chapter 7: Ingesting Analytical Data Part 2: Structuring the Ingestion Pipeline
Chapter 8: Designing Monitored Data Workflows Chapter 9: Putting Everything Together with Airflow Chapter 10: Logging and Monitoring Your Data Ingest in Airflow Chapter 11: Automating Your Data Ingestion Pipelines Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime Index Other Books You May Enjoy

Ingesting Analytical Data

Analytical data is a bundle of data that serves various areas (such as finances, marketing, and sales) in a company, university, or any other institution, to facilitate decision-making, especially for strategic matters. When transposing analytical data to a data pipeline or a usual Extract, Transform, and Load (ETL) process, it corresponds to the final step, where data is already ingested, cleaned, aggregated, and has other transformations accordingly to business rules.

There are plenty of scenarios where data engineers must retrieve data from a data warehouse or any other storage containing analytical data. The objective of this chapter is to learn how to read analytical data and its standard formats and cover practical use cases related to the reverse ETL concept.

In this chapter, we will learn about the following topics:

  • Ingesting Parquet files
  • Ingesting Avro files
  • Applying schemas to analytical data
  • Filtering data and handling...

Technical requirements

Like Chapter 6, in this chapter too, some recipes will need SparkSession initialized, and you can use the same session for all of them. You can use the following code to create your session:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("chapter7_analytical_data") \
      .config("spark.executor.memory", '3g') \
      .config("spark.executor.cores", '2') \
      .config("spark.cores.max", '2') \
      .getOrCreate()

Note

A WARN message as output is expected in some cases, especially if you are using WSL on Windows, so you don’t need to worry if you receive one.

You can also find the code from this chapter in its GitHub repository...

Ingesting Parquet files

Apache Parquet is a columnar storage format that is open source and designed to support fast processing. It is available to any project in a Hadoop ecosystem and can be read in different programming languages.

Due to its compression and fastness, this is one of the most used formats when needing to analyze data in great volume. The objective of this recipe is to understand how to read a collection of Parquet files using PySpark in a real-world scenario.

Getting ready

For this recipe, we will need SparkSession to be initialized. You can use the code provided at the beginning of this chapter to do so.

The dataset for this recipe will be Yellow Taxi Trip Records from New York. You can download it by accessing the NYC Government website and selecting 2022 | January | Yellow Taxi Trip Records or using this link:

https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet

Feel free to execute the code with a Jupyter notebook...

Ingesting Avro files

Like Parquet, Apache Avro is a widely used format to store analytical data. Apache Avro is a leading method of serialization to record data and relies on schemas. It also provides Remote Procedure Calls (RPCs), making transmitting data easier and resolving problems such as missing fields, extra fields, and naming fields.

In this recipe, we will understand how to read an Avro file properly and later comprehend how it works.

Getting ready

This recipe will require SparkSession with some different configurations from the previous Ingesting Parquet files recipe. If you are already running SparkSession, stop it using the following command:

spark.stop()

We will create another session in the How to do it… section.

The dataset used here can be found at this link: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook/tree/main/Chapter_7/ingesting_avro_files.

Feel free to execute the code in a Jupyter notebook or your PySpark...

Applying schemas to analytical data

In the previous chapter, we saw how to apply schemas to structured and unstructured data, but the application of a schema is not limited to raw files.

Even when working with already processed data, there will be cases when we need to cast the values of a column or change column names to be used by another department. In this recipe, we will learn how to apply a schema to Parquet files and how it works.

Getting ready

We will need SparkSession for this recipe. Ensure you have a session that is up and running. We will use the same dataset as in the Ingesting Parquet files recipe.

Feel free to execute the code using a Jupyter notebook or your PySpark shell session.

How to do it…

Here are the steps to perform this recipe:

  1. Looking at our columns: As seen in the Ingesting Parquet files recipe, we can list the columns and their inferred data types. You can see the list as follows:
     VendorID: long
     tpep_pickup_datetime...

Filtering data and handling common issues

Filtering data is a process of excluding or selecting only the necessary information to be used or stored. Even analytical data must be re-filtered to meet a specific need. An excellent example is data marts (we will cover them later in this recipe).

This recipe aims to understand how to create and apply filters to our data using a real-world example.

Getting ready

This recipe requires SparkSession, so ensure yours is up and running. You can use the code provided at the beginning of the chapter or create your own.

The dataset used here will be the same as in the Ingesting Parquet files recipe.

To make this exercise more practical, let’s imagine we want to analyze two scenarios: how many trips each vendor made and what hour of the day there are more pickups. We will create some aggregations and filter our dataset to carry out those analyses.

How to do it…

Here are the steps to perform this recipe:

...

Ingesting partitioned data

The practice of partitioning data is not recent. It was implemented in databases to distribute data across multiple disks or tables. Actually, data warehouses can partition data according to the purpose and use of the data inside. You can read more here: https://www.tutorialspoint.com/dwh/dwh_partitioning_strategy.htm.

In our case, partitioning data is related to how our data will be split into small chunks and processed.

In this recipe, we will learn how to ingest data that is already partitioned and how it can affect the performance of our code.

Getting ready

This recipe requires an initialized SparkSession. You can create your own or use the code provided at the beginning of this chapter.

The data required to complete the steps can be found here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook/tree/main/Chapter_7/ingesting_partitioned_data.

You can use a Jupyter notebook or a PySpark shell session to execute the...

Applying reverse ETL

As the name suggests, reverse ETL takes data from a data warehouse and inserts it into a business application such as HubSpot or Salesforce. The reason behind this is to make data more operational and use business tools to bring more insights to data that is already in a format ready for analysis or analytical format.

This recipe will teach us how to architect a reverse ETL pipeline and about the commonly used tools.

Getting ready

There are no technical requirements for this recipe. However, it is encouraged to use a whiteboard or a notepad to take notes.

Here, we will work with a scenario where we are ingesting data from an e-learning platform. Imagine we received a request from the marketing department to better understand user actions on the platform using the Salesforce system.

The objective here will be to create a diagram showing the data flow process from a source of data to the Salesforce platform.

How to do it…

To make this...

Selecting analytical data for reverse ETL

Now that we know what reverse ETL is, the next step is to understand which types of analytical data are a good use case to load into a Salesforce application, for example.

This recipe continues from the previous one, Applying reverse ETL, intending to illustrate a real scenario of deciding what data will be transferred into a Salesforce application.

Getting ready

This recipe has no technical requirements, but you can use a whiteboard or a notepad for annotations.

Still using the example of a scenario where the marketing department requested data to be loaded into their Salesforce account, we will now go a little deeper to see what information is relevant for their analysis.

We received a request from the marketing team to understand the user journey in the e-learning platform. They want to understand which courses are watched most and whether some need improvement. Currently, they don’t know what information we have in...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023 Publisher: Packt ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}