Reader small image

You're reading from  Data Wrangling on AWS

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781801810906
Edition1st Edition
Tools
Right arrow
Authors (3):
Navnit Shukla
Navnit Shukla
author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M
Sankar M
author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani
Sampat Palani
author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

View More author details
Right arrow

Data Processing for Machine Learning with SageMaker Data Wrangler

In Chapter 4, we introduced you to SageMaker Data Wrangler, a purpose-built tool to process data for machine learning. We discussed why data processing for machine learning is such a critical component of the overall machine learning pipeline and some risks of working with unclean or raw data. We also covered the core capabilities of SageMaker Data Wrangler and how it is helpful to solve some key challenges involved in data processing for machine learning.

In this chapter, we will take things further by taking a practical step-by-step data flow to preprocess an example dataset for machine learning. We will start by taking an example dataset that comes preloaded with SageMaker Data Wrangler and then do some basic exploratory data analysis using Data Wrangler built-in analysis. We will also add a couple of custom checks for imbalance and bias in the dataset. Feature engineering is a key step in the machine learning...

Technical requirements

If you wish to follow along, which I highly recommend, you will need an Amazon Web Services (AWS) account. If you do not have an existing account, you can create an AWS account under the Free Tier. The AWS Free Tier provides customers with the ability to explore and try out AWS services free of charge up to specified limits for each service. If your application use exceeds the Free Tier limits, you simply pay standard, pay-as-you-go service rates. In this chapter, we will get started by looking at how to access and get familiar with the SageMaker Data Wrangler user interface. As you follow along, you will use AWS Compute and also end up creating resources in your AWS account. This especially applies to the Training a machine learning model section of the chapter, which is both compute-intensive and creates an endpoint that you will have to delete. Please remember to clean up by deleting any unused resources. We will remind you again at the end of the chapter...

Step 1 – logging in to SageMaker Studio

In this section, we will cover the steps to log in and navigate inside the AWS console and SageMaker. If you are already familiar with using SageMaker, you can skip this section and move on directly to the next one.

After you have created your account and set up a SageMaker Studio domain and created a user, as covered in Chapter 4, you can log in to the AWS console and choose SageMaker. You can either navigate to SageMaker in the All Services section under Machine Learning or start typing SageMaker in the search box at the top of the AWS console.

Figure 10.1: AWS console – SageMaker

Figure 10.1: AWS console – SageMaker

Once you are on the SageMaker screen, you should see the domain you created in the prerequisite section in Chapter 4. Make sure that the status of the domain is InService before proceeding. If you do not see a domain at all, verify to make sure you are in the same region where you created your domain. Check and switch...

Step 2 – importing data

Before we can start importing data into SageMaker Data Wrangler, we need to create a connection with our data source. SageMaker Data Wrangler provides out-of-the-box native connectors to Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, Amazon EMR, and Databricks. Besides that, you can also set up new data sources with over 40 SaaS and web applications using Amazon AppFlow, a fully managed integration service that helps you securely transfer data between software as a service (SaaS) applications. The Create connection screen shows the connectors in Data Wrangler, along with additional data sources you can set up using Amazon AppFlow.

Figure 10.5: Data Wrangler data sources

Figure 10.5: Data Wrangler data sources

In this chapter, we will use a publicly available example, the Titanic dataset. The Titanic dataset is considered the “Hello World” of machine learning datasets due to the number of commonly used data processing and machine learning techniques...

Exploratory data analysis

Before we do any data transformation or manipulations, we need to get a good understanding of our data. Exploratory data analysis (EDA) is a crucial step in data science because it allows us to understand the structure and characteristics of the data we’re working with. EDA involves the use of various techniques and tools to summarize and visualize data in order to identify patterns, trends, and relationships. It is also important that we perform this step before we do any data transformations or modeling because EDA can help us understand which features are relevant and which are most important for the machine learning problem we are trying to solve. EDA can help you understand the distribution of data and identify any relationships that exist between the features in your dataset. When working with real-world data, you will inevitably encounter data quality issues such as missing data, imbalance in various classes, errors in data collection, and outliers...

Step 4 – adding transformations

As part of your data analysis, you might have noticed elements of your dataset that you want to change or transform. The goal of data transformation is to make data more suitable for modeling, to improve the performance of machine learning algorithms, or to handle missing or corrupted values. Data transformations for machine learning can include things such as normalization, standardization, data encoding, and binning. Not all datasets are alike, and not all transformations apply to all datasets. The goal of data analysis is to identify specific transformations for your dataset. While we typically apply data transformation as an early step in the machine learning pipeline, before data is used to train a model, in real-world machine learning, we continually monitor our model performance and apply transformations as necessary. After you have imported and inspected your dataset in Data Wrangler, you can start adding transformations to your data flow...

Step 5 – exporting data

So far, we have performed several analyses on our dataset. We have also defined several feature engineering data transformations. However, it is important to remember that we have made no changes to the actual data itself yet. We have defined the data flow, which contains a series of analysis and transformation steps that can be executed before we build machine learning models. If you check the data flow, it will look something similar to the following:

Figure 10.30: Completed data flow

Figure 10.30: Completed data flow

Data Wrangler provides you with several options to export your data flow:

  • Exporting to S3: Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You can do this by clicking the = button next to a data transform step and choosing Export To, and then Export to S3. Data Wrangler will create a Jupyter notebook that contains the code to do all the transformations as defined in your data flow and...

Training a machine learning model

We have now used Data Wrangler to do data analysis and processing, which involved several steps, such as data cleaning, preprocessing, feature engineering, and exploratory data analysis. These steps are crucial before doing machine learning, as they ensure that data is in the correct format, we selected the relevant features, and we dealt with outliers and missing values using data transformation. Data Wrangler provides a unified experience, enabling you to prepare data and seamlessly train a machine learning model, all from within the tool.

SageMaker Autopilot is a tool that automates the key tasks of an automatic machine learning (AutoML) process. This includes exploring your data, selecting algorithms relevant to your problem type, and preparing the data to facilitate model training and tuning. With just a few clicks, you can automatically build, train, and tune ML models using Autopilot, XGBoost, or your own algorithm, directly from the Data...

Summary

SageMaker Data Wrangler is a purpose-built tool specifically for analyzing and processing data for machine learning. It is also one of the foundational platforms for machine learning on AWS. This has been a long chapter, and although we covered several key features of Data Wrangler, there are still a few features that we left out of this book. We started by looking at how to log in to SageMaker Studio and access Data Wrangler. For the sample dataset, we used the built-in Titanic dataset that is available via a public S3 bucket. We imported this dataset into Data Wrangler via the default sampling method. We then performed EDA, first by using the built-in insights report in Data Wrangler and then by adding additional analysis, including using our custom code. Next, we defined several data transformation steps for our Data Wrangler flow to do feature engineering. For this, we used several built-in data transformations in Data Wrangler. We also looked at applying a custom data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling on AWS
Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani