You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Data Processing for Machine Learning with SageMaker Data Wrangler

In Chapter 4, we introduced you to SageMaker Data Wrangler, a purpose-built tool to process data for machine learning. We discussed why data processing for machine learning is such a critical component of the overall machine learning pipeline and some risks of working with unclean or raw data. We also covered the core capabilities of SageMaker Data Wrangler and how it is helpful to solve some key challenges involved in data processing for machine learning.

In this chapter, we will take things further by taking a practical step-by-step data flow to preprocess an example dataset for machine learning. We will start by taking an example dataset that comes preloaded with SageMaker Data Wrangler and then do some basic exploratory data analysis using Data Wrangler built-in analysis. We will also add a couple of custom checks for imbalance and bias in the dataset. Feature engineering is a key step in the machine learning...

Technical requirements

If you wish to follow along, which I highly recommend, you will need an Amazon Web Services (AWS) account. If you do not have an existing account, you can create an AWS account under the Free Tier. The AWS Free Tier provides customers with the ability to explore and try out AWS services free of charge up to specified limits for each service. If your application use exceeds the Free Tier limits, you simply pay standard, pay-as-you-go service rates. In this chapter, we will get started by looking at how to access and get familiar with the SageMaker Data Wrangler user interface. As you follow along, you will use AWS Compute and also end up creating resources in your AWS account. This especially applies to the Training a machine learning model section of the chapter, which is both compute-intensive and creates an endpoint that you will have to delete. Please remember to clean up by deleting any unused resources. We will remind you again at the end of the chapter...

Step 1 – logging in to SageMaker Studio

In this section, we will cover the steps to log in and navigate inside the AWS console and SageMaker. If you are already familiar with using SageMaker, you can skip this section and move on directly to the next one.

After you have created your account and set up a SageMaker Studio domain and created a user, as covered in Chapter 4, you can log in to the AWS console and choose SageMaker. You can either navigate to SageMaker in the All Services section under Machine Learning or start typing SageMaker in the search box at the top of the AWS console.

Figure 10.1: AWS console – SageMaker

Once you are on the SageMaker screen, you should see the domain you created in the prerequisite section in Chapter 4. Make sure that the status of the domain is InService before proceeding. If you do not see a domain at all, verify to make sure you are in the same region where you created your domain. Check and switch...

Step 2 – importing data

Before we can start importing data into SageMaker Data Wrangler, we need to create a connection with our data source. SageMaker Data Wrangler provides out-of-the-box native connectors to Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, Amazon EMR, and Databricks. Besides that, you can also set up new data sources with over 40 SaaS and web applications using Amazon AppFlow, a fully managed integration service that helps you securely transfer data between software as a service (SaaS) applications. The Create connection screen shows the connectors in Data Wrangler, along with additional data sources you can set up using Amazon AppFlow.

Figure 10.5: Data Wrangler data sources

In this chapter, we will use a publicly available example, the Titanic dataset. The Titanic dataset is considered the “Hello World” of machine learning datasets due to the number of commonly used data processing and machine learning techniques...

Exploratory data analysis

Before we do any data transformation or manipulations, we need to get a good understanding of our data. Exploratory data analysis (EDA) is a crucial step in data science because it allows us to understand the structure and characteristics of the data we’re working with. EDA involves the use of various techniques and tools to summarize and visualize data in order to identify patterns, trends, and relationships. It is also important that we perform this step before we do any data transformations or modeling because EDA can help us understand which features are relevant and which are most important for the machine learning problem we are trying to solve. EDA can help you understand the distribution of data and identify any relationships that exist between the features in your dataset. When working with real-world data, you will inevitably encounter data quality issues such as missing data, imbalance in various classes, errors in data collection, and outliers...

Step 4 – adding transformations

As part of your data analysis, you might have noticed elements of your dataset that you want to change or transform. The goal of data transformation is to make data more suitable for modeling, to improve the performance of machine learning algorithms, or to handle missing or corrupted values. Data transformations for machine learning can include things such as normalization, standardization, data encoding, and binning. Not all datasets are alike, and not all transformations apply to all datasets. The goal of data analysis is to identify specific transformations for your dataset. While we typically apply data transformation as an early step in the machine learning pipeline, before data is used to train a model, in real-world machine learning, we continually monitor our model performance and apply transformations as necessary. After you have imported and inspected your dataset in Data Wrangler, you can start adding transformations to your data flow...

Step 5 – exporting data

So far, we have performed several analyses on our dataset. We have also defined several feature engineering data transformations. However, it is important to remember that we have made no changes to the actual data itself yet. We have defined the data flow, which contains a series of analysis and transformation steps that can be executed before we build machine learning models. If you check the data flow, it will look something similar to the following:

Figure 10.30: Completed data flow

Data Wrangler provides you with several options to export your data flow:

Exporting to S3: Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You can do this by clicking the = button next to a data transform step and choosing Export To, and then Export to S3. Data Wrangler will create a Jupyter notebook that contains the code to do all the transformations as defined in your data flow and...

Training a machine learning model

We have now used Data Wrangler to do data analysis and processing, which involved several steps, such as data cleaning, preprocessing, feature engineering, and exploratory data analysis. These steps are crucial before doing machine learning, as they ensure that data is in the correct format, we selected the relevant features, and we dealt with outliers and missing values using data transformation. Data Wrangler provides a unified experience, enabling you to prepare data and seamlessly train a machine learning model, all from within the tool.

SageMaker Autopilot is a tool that automates the key tasks of an automatic machine learning (AutoML) process. This includes exploring your data, selecting algorithms relevant to your problem type, and preparing the data to facilitate model training and tuning. With just a few clicks, you can automatically build, train, and tune ML models using Autopilot, XGBoost, or your own algorithm, directly from the Data...

Summary

SageMaker Data Wrangler is a purpose-built tool specifically for analyzing and processing data for machine learning. It is also one of the foundational platforms for machine learning on AWS. This has been a long chapter, and although we covered several key features of Data Wrangler, there are still a few features that we left out of this book. We started by looking at how to log in to SageMaker Studio and access Data Wrangler. For the sample dataset, we used the built-in Titanic dataset that is available via a public S3 bucket. We imported this dataset into Data Wrangler via the default sampling method. We then performed EDA, first by using the built-in insights report in Data Wrangler and then by adding additional analysis, including using our custom code. Next, we defined several data transformation steps for our Data Wrangler flow to do feature engineering. For this, we used several built-in data transformations in Data Wrangler. We also looked at applying a custom data...

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages