Reader small image

You're reading from  Building ETL Pipelines with Python

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781804615256
Edition1st Edition
Right arrow
Authors (2):
Brij Kishore Pandey
Brij Kishore Pandey
author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

Emily Ro Schoof
Emily Ro Schoof
author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof

View More author details
Right arrow

Data Cleansing and Transformation

The success of a data pipeline is measured by its ability to transform the input data into the required attributes of the output data. It’s the finesse of the transformation stage that separates a nice toy pipeline from a powerful and impactful enterprise pipeline. The accuracy and optimization of data transformations are manifested via the use of methodical approaches to construct each task performed.

In this chapter, we will explore various data transformation techniques in Python, and how these techniques can be used to massage data into the desired format. You will walk away from this chapter with a firm basis in the following areas of data manipulation:

  • Data cleansing and transformation
  • The importance of accuracy and consistency
  • Data cleansing with Python
  • Workflow for data transformation
  • Creating a data transformation activity in Python

As this book is geared toward creating data pipelines, we will be covering...

Technical requirements

To effectively utilize the resources and code examples provided in this chapter, ensure that your system meets the following technical requirements:

  • Software requirements:
    • Integrated development environment (IDE): We recommend using PyCharm as the preferred IDE for working with Python, and we might make specific references to PyCharm throughout this chapter. However, you are free to use any Python-compatible IDE of your choice.
    • Jupyter Notebooks should be installed.
    • Python version 3.6 or higher should be installed.
    • Pipenv should be installed for managing dependencies.
  • GitHub repository:

    The associated code and resources for this chapter can be found in the following GitHub repository: https://github.com/PacktPublishing/Building-ETL-Pipelines-with-Python. We recommend that you fork and clone the repository to your local machine.

Exploring data cleansing and transformation

The extraction process is needed to select data that is significant in supporting...

Strategies for data cleansing and transformation in Python

Python’s rich ecosystem of data-centric libraries, such as Pandas and NumPy, allows the seamless detection and correction of inconsistencies, errors, or missing values, leading to better data integrity and reliability. In transformation, data is reshaped, normalized, or aggregated to suit specific needs. Python’s flexibility enables complex transformations and operations such as merging datasets, grouping data, or creating pivot tables, which are often necessary for advanced analytics or machine learning models.

Preliminary tasks – the importance of staging data

The extracted data is sent to a temporary storage area called the data staging area prior to the transformation and cleansing process. This is done to avoid the need to extract data again, should any problem occur (reference: A Five-Layered Business Intelligence Architecture by In Ong et al.).

Step 1 – data discovery and interpretation...

Summary

The data cleansing and transformation steps within a data pipeline are fundamental processes that are central to preparing high-quality output datasets. Creating a systematic approach to identifying and rectifying inconsistencies, inaccuracies, and missing values enhances data integrity and reliability while refining and tailoring the data to match the specific needs of your end user. Your output data can then be confidently used for any data-driven decision-making, analysis, and machine learning.

As data continues to grow in size and complexity, mastering data cleansing and transformation techniques becomes increasingly crucial, enabling data-driven organizations to uncover hidden insights and streamline operations. It is a ubiquitous and valuable skill in today’s data-dependent world.

In the next chapter, we will discuss how to load transformed data into tables.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building ETL Pipelines with Python
Published in: Sep 2023Publisher: PacktISBN-13: 9781804615256
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Brij Kishore Pandey

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.
Read more about Brij Kishore Pandey

author image
Emily Ro Schoof

Emily Ro Schoof is a dedicated data specialist with a global perspective, showcasing her expertise as a data scientist and data engineer on both national and international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Emily's multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a subject matter expert for General Assembly's Data Engineering course content and design. Her mission is to empower individuals to leverage data for positive impact. Emily holds the strong belief that providing easy access to resources that merge theory and real-world applications is the essential first step in this process.
Read more about Emily Ro Schoof