You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas

In the previous chapters, we learned about the data-wrangling process and how to utilize different services for data-wrangling activities within the AWS ecosystem:

We explored AWS Glue DataBrew, which helps you in creating a data-wrangling pipeline through a GUI-based approach for every type of user.
We also went through SageMaker Data Wrangler, which also helps users in creating a GUI-based data-wrangling pipeline, but it’s more closely aligned with machine learning workloads with tighter integration with the SageMaker service.
We also explored AWS SDK for Pandas, aka awswrangler, which is a hands-on coding approach to data wrangling that integrates the Pandas library with the AWS ecosystem. This will be used by users who are more hands-on with Python programming and are in love with the Pandas library and its capabilities.
We also went through different AWS services such as Amazon S3...

A solution walkthrough for sportstickets.com

We will walk through a fictional example, sportstickets.com, which is a sports-ticketing franchise. This company manages different sporting events and sells tickets for sports events at a discounted rate. The business analysts from sportsticket.com want to set up an end-to-end data-wrangling pipeline for performing ticket sales analysis on the data.

We will explore the different phases of the data-wrangling pipeline and explain how the Pandas library will help in performing those operations in an effective and performant manner.

Figure 9.1: Different phases of the data-wrangling pipeline

Prerequisites for data ingestion

In order to perform data-wrangling activities for the preceding use case, we need to first ingest data into a data lake. In order to ingest data from on-premise databases into a cloud environment, we have the following options:

Extract data programmatically using SQL queries from...

Data discovery

Data discovery is an important phase in the wrangling pipeline, as it helps users to understand the data and guides how the next steps should be done. For example, if the user looks at the data and determines certain columns have missing values, data cleansing should fix those values and any missing columns can be added by joining the data with other data sources or deriving them from raw data. Essentially, this step will give an idea of the completeness, usefulness, and relevance of the dataset to users.

There are multiple ways to perform data discovery including downloading small files on a local machine and using Excel files to explore the data. We will look at ways in which we can explore the raw data stored in a data lake. Some of the common steps that are performed during a data discovery phase are as follows:

Identifying the source data structure/format and its associated properties
Visualizing the data distribution on the dataset
Validating...

Data structuring

Now we have to come to the part where we need to restructure the data into a usable format from its raw format. In our use case, we extracted data in JSON format and it is good for exploratory analysis that we used a raw data format. When we move further into the data-wrangling pipeline, different file formats and structures would be more efficient.

Different file formats and when to use them

There are different file formats that are commonly used in data pipelines:

Readable file formats: CSV, JSON, and Extensible Markup Language (XML) are some file formats that are readable by human users:
- CSV files are used mostly in the data extraction phase when the data needs to be shared with analysts for reading and performing further actions. The advantage is you don’t need any programming language to read the files and can be opened in the most commonly available text editors. These file formats are widely popular earlier in the data analytics community...

Data cleaning

Data cleaning is an important step in the process of data wrangling. A good amount of time is spent on identifying the right data source and cleaning the data. Pandas provides a lot of functionalities for cleaning your data.

The exact activities that are required during this phase are different for each type of dataset. Certain data sources will have data that requires only minimal cleaning and certain other data sources might require a lot of cleaning activities before the dataset can be used in your project. You could also use the output of data exploration activities to understand the level of cleaning activities to be performed on the data.

Data cleansing with Pandas

In order to demonstrate the data cleaning steps, we will use the seat_type table from our database. This table only has minimal data volume, so we will insert some data before we proceed with data cleansing.

The data in seat_type looks like the screenshot here. It has three columns for the...

Data enrichment

We have learned how to extract data from various sources, put them into a common or desired structure, and cleanse data issues. However, the data is still in silos from individual sources and it would be valuable to combine data from multiple sources to enrich the data with additional information.

Let us consider a use case where we will publish a schedule of football games for a specific team. Before we do that, we must understand our dataset in order to join relevant tables. We will first print the list of tables that is available in our source database.

Figure 9.49: Printing all tables from the source database for a sporting ticket dataset

We have a list of tables, and the following diagram will illustrate the relationship across different tables in our dataset:

Data quality validation

Data quality validation is an important phase in data pipelines as it ensures the correctness of the data used in analyses. Without correct data, even if you use good analytical tools, the analytical insights will be incorrect. So, customers/developers need to focus more on the data quality phase to create accurate datasets for further analysis.

What is the difference between data quality and data cleansing? Some of us might be confused between data cleansing and data quality validation. In reality, there will be some overlap between the two phases, and some activities are used interchangeably:

Data cleansing is the phase where we clean and deduplicate data and identify generic data issues, such as splitting data for more meaningful analysis, cleansing data errors, and so on. Without cleansing, the data might not be useful for analysis efforts. For example, in a student database and results table, the score column can have non-numeric values or missing...

Data visualization

Data visualization is the phase where you will create visuals and charts to better communicate the findings of your analysis to business users. A picture is worth a thousand words and an idea/message can be better communicated with charts/dashboards than tables/text data.

Visualization with Python libraries

In this section, we will explore creating dashboards using Python libraries such as matplotlib (https://matplotlib.org/) and Seaborn (https://seaborn.pydata.org/index.html). There are more Python libraries that can help in visualizing data, but we will not compare all those libraries here.

We will use the same sports dataset for this section as well. There are other datasets such as NY taxi trips datasets that we can use to cover different visualization aspects, but we will use the sports data for continuity purposes in this chapter. Let us consider a use case, where we want to visualize the following requirements:

The number of tickets sold on...

Summary

In this chapter, we learned how to perform pandas operations on datasets for data-wrangling purposes. We explored various stages of the data-wrangling life cycle, through discovery, structuring, cleansing, enriching, data quality validation, and visualization, and the usage of pandas operations to perform those activities seamlessly. Users can use AWS SDK for pandas, aka awswrangler integration, with pandas DataFrames to perform data-wrangling activities on AWS cloud services.

In the next chapter, we are going to learn about SageMaker Data Wrangler, which helps in performing data-wrangling activities as a part of ML pipelines.

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Table name	Description
`mlb_data`	Contains stats...