Reader small image

You're reading from  Building ETL Pipelines with Python

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781804615256
Edition1st Edition
Right arrow
Authors (2):
Brij Kishore Pandey
Brij Kishore Pandey
author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

Emily Ro Schoof
Emily Ro Schoof
author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof

View More author details
Right arrow

Loading Transformed Data

After data has undergone processing and transformation within an ETL pipeline, the final step involves transferring it to its designated final location. The type of output location that’s used is determined by both the data’s specific utility and the tools available within your environment.

It is most common to store output data iterations within structured, relational databases. Such databases offer an easily accessible format conducive to both analytical exploration and forthcoming modifications.

In this chapter, we’ll get more acquainted with the “L” of ETL pipelines and the best practices for designing efficient load operations. We will discuss how to optimize load activities to fit the output data using either incremental or full data loading activities. Finally, we’ll provide a comprehensive walkthrough for creating a relational database management system (RDBMS) on your local device that you can utilize...

Technical requirements

To effectively utilize the resources and code examples provided in this chapter, ensure that your system meets the following technical requirements:

  • Software requirements:
    • Integrated development environment (IDE): We recommend using PyCharm as the preferred IDE for working with Python, and we might make specific references to PyCharm throughout this chapter. However, you are free to use any Python-compatible IDE of your choice.
    • Jupyter Notebooks should be installed.
    • Python version 3.6 or higher should be installed.
    • Pipenv should be installed to manage dependencies.
  • GitHub repository: The associated code and resources for this chapter can be found in this book’s GitHub repository at https://github.com/PacktPublishing/Building-ETL-Pipelines-with-Python. Fork and clone the repository to your local machine.

Introduction to data loading

Data loading is the final step of the ETL process, and arguably the entire purpose of the data pipeline. The loading phase of your ETL process requires careful preparation to ensure the transformed data is seamlessly transitioned to its destination.

Choosing the load destination

The choice of destination greatly impacts data accessibility, storage, querying capabilities, and overall system performance. Depending on the nature of your project, you might be loading data into relational databases, cloud-based data warehouses, NoSQL stores, or other repositories. Understanding the target system’s requirements and capabilities is a foundational step to designing an efficient loading strategy. Consider factors such as data types, indexing, partitioning, and data distribution.

Python’s adaptability ensures that, regardless of your destination choice, you have the tools at hand to integrate, manipulate, and optimize data loading processes...

Best practices for data loading

There isn’t one universally definitive approach to creating data pipeline loading activities, but some methods are more effective than others. Proper preparation and adherence to best practices empower you to navigate the data loading phase with confidence, optimizing efficiency, accuracy, and reliability in your ETL workflow.

The process of designing a data loading activity reflects the level of understanding you have of the full environmental conditions of your system. You can use the following three principles to design a data loading workflow that is both scalable and reusable:

  • Utilizing techniques such as bulk loading, parallel processing, and optimized SQL queries can significantly enhance loading performance for large datasets. By adopting scalable strategies, you ensure that your data loading solution remains efficient and responsive even as data volume increases.
  • Automation streamlines the loading process, reducing the...

Optimizing data loading activities by controlling the data import method

From the aforementioned list, this chapter specifically focuses on utilizing different loading strategies to design efficient loading activities. Later in this book, we will dive deeper into automation Python modules and cloud resources (Chapters 8 and 9), and monitoring ETL pipelines (Chapter 14). But for now, let’s focus on the two most common methods for data loads: full and incremental.

Creating demo data

We will utilize Python’s sqlite3 database to walk through a demo of how full and incremental data loads can be formed using Python. From the chapter_06/ directory in your PyCharm environment, open the Loading_Transformed_Data.ipynb file by initiating Jupyter in your PyCharm terminal by running the following command:

(venv) (base) usr@usr-MBP chapter_06  % jupyter notebook

Verify that the following code is in your Jupyter notebook:

# import modulesimport sqlite3
# demo...

Precautions to consider

Back in Chapter 2, we referenced that there is a wide range of purposes for data pipelines, ranging from daily updates for business analytics dashboards to cyclical long-term storage. Since many organizations make decisions based on the resulting output data, not only is the accuracy of data transformations crucial, but the resulting format and quality of the data loaded need to remain cohesive with the data that already exists within the target location.

In a clean, reproducible, and scalable data ecosystem, the target data output location maintains its own, arguably authoritative, structure that serves as the ground truth for business data within your company. It requires you to scrutinously manage the ongoing ETL processes that keep the storage environment up to date. When discussing the differences between full and incremental data loads in the previous section, it became clear that there is a need to distinguish between new, freshly curated data and...

Tutorial – preparing your local environment for data loading activities

In this section, we will use a local PostgreSQL database as the data load destination; we will do the same for many of the examples throughout this book. PostgreSQL is a free and open source RDBMS and supports SQL compliance. You can learn more about PostgreSQL at https://en.wikipedia.org/wiki/PostgreSQL.

Downloading and installing PostgreSQL

Depending on your device, select the respective download link in your browser (https://www.enterprisedb.com/downloads/postgres-postgresql-downloads) and follow these steps:

  1. Download the postgres.app installer.
  2. Click on the .dmg file and double-click the box to initiate the download.
  3. Once the files are downloaded, the EDB PostgreSQL Setup Wizard will begin. Make sure you choose the following criteria during your installation:

    Summary

    In this chapter, we introduced the premise of ETL pipeline data loading activities, as well as some of the contingencies of designing these activities correctly. We walked through the essential steps of setting up data storage destinations and structuring schemas in anticipation of the resultant data from our pipeline. We also introduced a blend of Python capabilities to fully or incrementally load data using SQLite. Lastly, we set up our local environment with PostgreSQL, which we will use as our data loading output location for the remainder of this book. In the next chapter, we will guide you through the entire process of creating a fully operational data pipeline.

    lock icon
    The rest of the chapter is locked
    You have been reading a chapter from
    Building ETL Pipelines with Python
    Published in: Sep 2023Publisher: PacktISBN-13: 9781804615256
    Register for a free Packt account to unlock a world of extra content!
    A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
    undefined
    Unlock this book and the full library FREE for 7 days
    Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
    Renews at $15.99/month. Cancel anytime

    Authors (2)

    author image
    Brij Kishore Pandey

    Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
    Read more about Brij Kishore Pandey

    author image
    Emily Ro Schoof

    Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
    Read more about Emily Ro Schoof

    Installation Directory

    /Library/PostgreSQL/15

    Port

    5432

    Select Components

    • PostgreSQL Server
    • pgAdmin 4
    • Stack Builder
    • Command Line Tools...