Reader small image

You're reading from  Building ETL Pipelines with Python

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781804615256
Edition1st Edition
Right arrow
Authors (2):
Brij Kishore Pandey
Brij Kishore Pandey
author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

Emily Ro Schoof
Emily Ro Schoof
author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof

View More author details
Right arrow

Tutorial – Creating an ETL Pipeline in AWS

In today’s cloud-based landscape, Amazon Web Services (AWS) offers a suite of tools that allows data engineers to build robust, scalable, and efficient ETL pipelines. In the previous chapter, we introduced you to some of AWS’s most common resources within its platform, as well as set up your local environment for development with AWS tools. This chapter will guide you through the process of leveraging these tools, illustrating how to architect and implement an effective ETL pipeline in the AWS environment. We will walk you through the creation of a deployable ETL pipeline in Python Lambda Functions and AWS Step Functions. Finally, we’ll create a scalable pipeline using Bonobo, EC2, and RDS. These tools will help all of your data pipelines harness the power of the cloud.

The chapter will cover the following topics:

  • Creating a Python pipeline with AWS Lambda and Step Functions:
    • Setting up the AWS CLI in...

Technical requirements

To effectively utilize the resources and code examples provided in this chapter, ensure that your system meets the following technical requirements:

  • Software requirements:
    • Integrated development environment (IDE): We recommend using PyCharm as the preferred IDE for working with Python, and we might make specific references to PyCharm throughout this chapter. However, you are free to use any Python-compatible IDE of your choice.
    • Jupyter Notebooks should be installed.
    • Python version 3.6 or higher should be installed.
    • Pipenv should be installed for managing dependencies.
  • GitHub repository: The associated code and resources for this chapter can be found in this book’s GitHub repository at https://github.com/PacktPublishing/Building-ETL-Pipelines-with-Python. Fork and clone the repository to your local machine.

Creating a Python pipeline with Amazon S3, Lambda, and Step Functions

In this section, we will create a simple ETL pipeline using AWS Lambda and Step Functions. AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers, while Step Functions provides a way to orchestrate the serverless lambda functions and other AWS services into workflows.

Setting the stage with the AWS CLI

Click into the chapter_10 directory of this book’s GitHub repository in your local PyCharm environment. Within the PyCharm terminal, run the following command to configure the AWS CLI:

(Project) usr@project % aws configure

You will then be prompted to enter your access key ID, secret access key, default region name, and default output format. Use your internet browser to log in to your AWS management console to get the following credentials:

AWS Access Key ID [None]: <YOUR ACCESS KEY ID HERE>AWS Secret Access Key [None]: <YOUR SECRET...

An introduction to a scalable ETL pipeline using Bonobo, EC2, and RDS

Extract, Transform, and Load (ETL) pipelines play a crucial role in data processing, enabling organizations to move data from multiple sources, process it, and load it into a data warehouse or other target system for analysis. However, as data volumes grow, so does the need for scalable ETL pipelines that can handle large amounts of data efficiently.

Amazon EC2 is a cloud service that provides virtual computing resources on-demand, offering a scalable and reliable platform to run various types of applications, including web servers, databases, and machine learning models. Amazon RDS is a fully managed relational database service that can be flexibly managed in the cloud, providing a scalable and reliable platform to run large database workloads.

When combined with an ETL-specific Python module such as Bonobo, Amazon EC2 and RDS can be leveraged to create an easily scalable data pipeline. This approach enables...

Summary

In this chapter, we’ve taken a comprehensive journey of how to create both simple and scalable ETL pipelines with AWS, using both the AWS CLI for local development and the AWS management console to use the GUI interface. At first pass, the AWS environment can be a bit overwhelming, but with just a little practice, you’ll start to feel a familiar flow between Amazon resources.

In Chapter 11, we will introduce the use of CI/CD pipelines, specifically tailored for ETL processes. We’ll discuss the significance of CI/CD pipelines in automating code deployment and enhancing efficiency, reliability, and speed in your ETL process. You’ll learn about AWS CodePipeline, AWS CodeDeploy, and AWS CodeCommit and how these services work together to create a robust and automated CI/CD pipeline for your ETL jobs.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building ETL Pipelines with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781804615256
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof