Reader small image

You're reading from  Building ETL Pipelines with Python

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781804615256
Edition1st Edition
Right arrow
Authors (2):
Brij Kishore Pandey
Brij Kishore Pandey
author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

Emily Ro Schoof
Emily Ro Schoof
author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof

View More author details
Right arrow

Powerful ETL Libraries and Tools in Python

Up to this point in the book, we have covered the fundamentals of building data pipelines. We’ve introduced some of Python’s most common modules that can be utilized to establish rudimentary iterations of data pipelines. While this is a great place to start, these methods are far from the most realistic approach; there is no lack of space for improvement. There are several powerful, ETL-specific Python libraries and pipeline management platforms that we can use to our advantage to make more durable, scalable, and resilient data pipelines suitable for real-world deployment scenarios.

We will divide this chapter into two parts. We start by introducing six of Python’s most popular ETL pipeline libraries. We will use the same “seed” ETL activities with each library, walking through how each of the following resources can be used to create an organized, reusable data ETL pipeline:

  • Part 1 – ETL...

Technical requirements

To effectively utilize the resources and code examples provided in this chapter, ensure that your system meets the following technical requirements:

  • Software requirements:
    • Integrated Development Environment (IDE): We recommend using PyCharm as the preferred IDE for working with Python, and we might make specific references to PyCharm throughout this chapter. However, you are free to use any Python-compatible IDE of your choice.
    • Jupyter Notebooks should be installed.
    • Python version 3.6 or higher should be installed.
    • Pipenv should be installed to manage dependencies.
  • GitHub repository:

    The associated code and resources for this chapter can be found in the GitHub repository at https://github.com/PacktPublishing/Building-ETL-Pipelines-with-Python. Fork and clone the repository to your local machine.

Architecture of Python files

In the chapter_08 directory in the GitHub repository, make sure the following files are within your directory:

├── Powerful_ETL_Tools_In_Python.ipynb├── data
│   ├── traffic_crash_people.csv
│   ├── traffic_crash_vehicle.csv
│   └── traffic_crashes.csv
├── etl
│   ├── __init__.py
│   ├── extract.py
│   ├── transform.py
│   └── load.py
├── tools
│   ├── __init__.py
│   ├── 01_bonobo_pipeline.py
│   ├── 02_odo_pipeline.py
│   ...

Configuring your local environment

There are many ways to set up configuration files in your ETL project repositories. We will utilize two different forms, a config.ini file and a config.yaml file. Both work equally well, but we will use the config.yaml version more frequently. This is more of a “dealer’s choice” situation than anything else.

config.ini

Open the config.ini file and replace username and password with the credentials for your local PostgreSQL environment:

[postgresql]host = localhost
port = 5432
database = chicago_dmv
user = postgres
password = password

To import the config.ini file in the chapter_08/ directory to your p0#_<module>_pipeline.py file, we will use the following syntax:

# Read the Configuration Fileimport configparser
config = configparser.ConfigParser()
config.read('config.ini')

config.yaml

Open the config.yaml file and perform the same task to replace username and password with the credentials for...

Part 1 – ETL tools in Python

In your local environment, open the Powerful_ETL_Tools_In_Python.ipynb file using jupyter notebook in the command line in your PyCharm terminal.

Bonobo

Bonobo (https://www.bonobo-project.org/) is a Python-based Extract, Transform, Load (ETL) framework that uses a simple and rather elegant approach to pipeline construction. Bonobo treats any callable (i.e., function) or iterable object in Python as a node, which the module can then organize into graphs and structures to execute each object with simplicity. Bonobo makes it incredibly easy to build, test, and deploy pipelines, which allows you to focus on the business logic of your pipeline and not the underlying infrastructure.

Figure 8.2: Bonobo is the Swiss Army knife for everyday data

Figure 8.2: Bonobo is the Swiss Army knife for everyday data

Installing and using Bonobo in your environment

In your PyCharm terminal, install Bonobo using your pipenv environment with the following command:

pipenv install bonobo

Head...

Part 2 – pipeline workflow management platforms in Python

All the Python modules we’ve introduced up to this point in the chapter are valuable tools to improve the efficacy and speed of Python data pipelines, but these modules won’t solve everything. They do not provide a one-size-fits-all solution. As your data requirements expand, you will inevitably encounter the challenge of accommodating increasing capacity.

Pipeline workflow management platforms streamline and automate data pipeline deployments, and are particularly useful in scenarios where multiple tasks need to be executed in a specific order or in parallel, and where data needs to be transformed and passed between asynchronous stages of a given pipeline. There are a number of pipeline workflow management platforms available for Python. Here are some of the most popular ones:

  • Apache Airflow: A platform to programmatically author, schedule, and monitor workflows
  • Apache Nifi: An easy-to-use...

Summary

In this chapter, we went through a series of examples to demonstrate how to build robust ETL pipelines in Python using various frameworks and libraries.

By understanding the different frameworks and libraries available for building ETL pipelines in Python, data engineers and analysts can make informed decisions about how to optimize their workflows for efficiency, reliability, and maintainability. With the right tools and practices, ETL can be a powerful and streamlined process that enables organizations to leverage the full potential of their data assets.

In the next chapter, we will continue to dig deeper into creating robust data pipelines using external resources. More specifically, we will introduce the AWS ecosystem and demonstrate how you can leverage AWS to create exceptional, scalable, and cloud-based ETL pipelines.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building ETL Pipelines with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781804615256
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof