Reader small image

You're reading from  Data Wrangling on AWS

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781801810906
Edition1st Edition
Tools
Right arrow
Authors (3):
Navnit Shukla
Navnit Shukla
author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M
Sankar M
author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani
Sampat Palani
author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

View More author details
Right arrow

Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas

In the previous chapters, we learned about the data-wrangling process and how to utilize different services for data-wrangling activities within the AWS ecosystem:

  • We explored AWS Glue DataBrew, which helps you in creating a data-wrangling pipeline through a GUI-based approach for every type of user.
  • We also went through SageMaker Data Wrangler, which also helps users in creating a GUI-based data-wrangling pipeline, but it’s more closely aligned with machine learning workloads with tighter integration with the SageMaker service.
  • We also explored AWS SDK for Pandas, aka awswrangler, which is a hands-on coding approach to data wrangling that integrates the Pandas library with the AWS ecosystem. This will be used by users who are more hands-on with Python programming and are in love with the Pandas library and its capabilities.
  • We also went through different AWS services such as Amazon S3...

A solution walkthrough for sportstickets.com

We will walk through a fictional example, sportstickets.com, which is a sports-ticketing franchise. This company manages different sporting events and sells tickets for sports events at a discounted rate. The business analysts from sportsticket.com want to set up an end-to-end data-wrangling pipeline for performing ticket sales analysis on the data.

We will explore the different phases of the data-wrangling pipeline and explain how the Pandas library will help in performing those operations in an effective and performant manner.

Figure 9.1: Different phases of the data-wrangling pipeline

Figure 9.1: Different phases of the data-wrangling pipeline

Prerequisites for data ingestion

In order to perform data-wrangling activities for the preceding use case, we need to first ingest data into a data lake. In order to ingest data from on-premise databases into a cloud environment, we have the following options:

  1. Extract data programmatically using SQL queries from...

Data discovery

Data discovery is an important phase in the wrangling pipeline, as it helps users to understand the data and guides how the next steps should be done. For example, if the user looks at the data and determines certain columns have missing values, data cleansing should fix those values and any missing columns can be added by joining the data with other data sources or deriving them from raw data. Essentially, this step will give an idea of the completeness, usefulness, and relevance of the dataset to users.

There are multiple ways to perform data discovery including downloading small files on a local machine and using Excel files to explore the data. We will look at ways in which we can explore the raw data stored in a data lake. Some of the common steps that are performed during a data discovery phase are as follows:

  • Identifying the source data structure/format and its associated properties
  • Visualizing the data distribution on the dataset
  • Validating...

Data structuring

Now we have to come to the part where we need to restructure the data into a usable format from its raw format. In our use case, we extracted data in JSON format and it is good for exploratory analysis that we used a raw data format. When we move further into the data-wrangling pipeline, different file formats and structures would be more efficient.

Different file formats and when to use them

There are different file formats that are commonly used in data pipelines:

  • Readable file formats: CSV, JSON, and Extensible Markup Language (XML) are some file formats that are readable by human users:
    • CSV files are used mostly in the data extraction phase when the data needs to be shared with analysts for reading and performing further actions. The advantage is you don’t need any programming language to read the files and can be opened in the most commonly available text editors. These file formats are widely popular earlier in the data analytics community...

Data cleaning

Data cleaning is an important step in the process of data wrangling. A good amount of time is spent on identifying the right data source and cleaning the data. Pandas provides a lot of functionalities for cleaning your data.

The exact activities that are required during this phase are different for each type of dataset. Certain data sources will have data that requires only minimal cleaning and certain other data sources might require a lot of cleaning activities before the dataset can be used in your project. You could also use the output of data exploration activities to understand the level of cleaning activities to be performed on the data.

Data cleansing with Pandas

In order to demonstrate the data cleaning steps, we will use the seat_type table from our database. This table only has minimal data volume, so we will insert some data before we proceed with data cleansing.

The data in seat_type looks like the screenshot here. It has three columns for the...

Data enrichment

We have learned how to extract data from various sources, put them into a common or desired structure, and cleanse data issues. However, the data is still in silos from individual sources and it would be valuable to combine data from multiple sources to enrich the data with additional information.

Let us consider a use case where we will publish a schedule of football games for a specific team. Before we do that, we must understand our dataset in order to join relevant tables. We will first print the list of tables that is available in our source database.

Figure 9.49: Printing all tables from the source database for a sporting ticket dataset

Figure 9.49: Printing all tables from the source database for a sporting ticket dataset

We have a list of tables, and the following diagram will illustrate the relationship across different tables in our dataset:

Data quality validation

Data quality validation is an important phase in data pipelines as it ensures the correctness of the data used in analyses. Without correct data, even if you use good analytical tools, the analytical insights will be incorrect. So, customers/developers need to focus more on the data quality phase to create accurate datasets for further analysis.

What is the difference between data quality and data cleansing? Some of us might be confused between data cleansing and data quality validation. In reality, there will be some overlap between the two phases, and some activities are used interchangeably:

  • Data cleansing is the phase where we clean and deduplicate data and identify generic data issues, such as splitting data for more meaningful analysis, cleansing data errors, and so on. Without cleansing, the data might not be useful for analysis efforts. For example, in a student database and results table, the score column can have non-numeric values or missing...

Data visualization

Data visualization is the phase where you will create visuals and charts to better communicate the findings of your analysis to business users. A picture is worth a thousand words and an idea/message can be better communicated with charts/dashboards than tables/text data.

Visualization with Python libraries

In this section, we will explore creating dashboards using Python libraries such as matplotlib (https://matplotlib.org/) and Seaborn (https://seaborn.pydata.org/index.html). There are more Python libraries that can help in visualizing data, but we will not compare all those libraries here.

We will use the same sports dataset for this section as well. There are other datasets such as NY taxi trips datasets that we can use to cover different visualization aspects, but we will use the sports data for continuity purposes in this chapter. Let us consider a use case, where we want to visualize the following requirements:

  • The number of tickets sold on...

Summary

In this chapter, we learned how to perform pandas operations on datasets for data-wrangling purposes. We explored various stages of the data-wrangling life cycle, through discovery, structuring, cleansing, enriching, data quality validation, and visualization, and the usage of pandas operations to perform those activities seamlessly. Users can use AWS SDK for pandas, aka awswrangler integration, with pandas DataFrames to perform data-wrangling activities on AWS cloud services.

In the next chapter, we are going to learn about SageMaker Data Wrangler, which helps in performing data-wrangling activities as a part of ML pipelines.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling on AWS
Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Table name

Description

mlb_data

Contains stats...