You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Introduction to SageMaker Data Wrangler

Data processing is an integral part of machine learning (ML). In fact, it is not a stretch to say that ML models are only as good as the data that is used to train them. According to a Forbes survey from 2016, 80% of the time spent on an ML engineering project is data preparation. That is an astonishingly high percentage of time. Why is that the case? Due to the inherent characteristics of data in the real world, data preparation is both tedious and resource intensive. This real-world data is often referred to as dirty, unclean, noisy, or raw data in ML. In almost all cases, this is the type of data you begin your ML process with. Even in rare scenarios where you think you have good data, you still need to ensure that it is in the right format and scale it to be useful. Applying ML algorithms on this raw data would not give quality results as they would fail to identify patterns, detect anomalies correctly, or generalize well enough outside their...

Data import

Before you start to process your data using SageMaker Data Wrangler, you first need to import data into Data Wrangler. Using Data Wrangler, you can connect and import data from a variety of data stores. When you start Data Wrangler for the first time, the first screen you get asks whether you want to import data or use a sample dataset:

Figure 4.1 – Data Wrangler import

Amazon S3 is an object-based data store that has quickly become the de facto storage of the internet. Due to its low cost per GB and high levels of reliability, you can store and retrieve any amount of data, at any time, from anywhere on the web using Amazon S3. You can upload and access data both using the console or programmatically using APIs, which is also the most common way to work with data in Amazon S3. Amazon S3 implements bucket and object architecture. You can think of a bucket as a folder and objects as files that are logically stored inside the bucket. SageMaker...

Data orchestration

Data orchestration can be defined as the process of combining data from various sources, including the steps to import, transform, and load it to the destination data source, with the fundamental principle being the ability to automate all the steps involved in the data preparation steps in a repeatable and reusable form, which can then be integrated with the overall ML pipelines. While data orchestration can be used in a wider context that can also include resource provisioning, scaling, and monitoring, the core of data orchestration is creating and automation data workflows, and this is where we will focus for the remainder of the book. The other heavy-lifting tasks of provisioning, scaling, and monitoring are taken care of by AWS. SageMaker Data Wrangler uses a data flow to connect the datasets and perform transformation and analysis steps. This data flow can be used to define your data pipeline and consists of all the steps that are involved in data preparation...

Data transformation

Data processing for ML primarily includes data transformation. At its core, SageMaker Data Wrangler includes over 300 built-in transformations that are commonly used for cleaning, transforming, and featurizing your data specifically for data science and ML. Using these built-in transformations, you can transform columns within your dataset without having to write any code. In addition to these built-in transformations, you can add custom transformations using PySpark, Python, pandas, and PySpark SQL. Some of these transformations operate in place, while others create a new output column in your dataset. Whenever you incorporate a transform into your data flow, it introduces a new step in the process. Each added transform modifies your dataset and generates a fresh data frame as a result. Subsequently, any subsequent transforms you apply will be performed on this updated data frame. In the real world, datasets are often imbalanced. This imbalance can be in the form...

Insights and data quality

Generating insight reports on data and quality is part of data profiling. Data profiling is a broader term that includes all the processes involved in reviewing the data source, understanding the structure and composition of data, generating useful summaries and statistics from the data, and studying any inherent relationships that exist between various features in the dataset. Before you even begin to take steps to process your data, you must first understand the state of your data. As such, data profiling is most often one of the first steps done after the data is imported into Data Wrangler. Data Wrangler provides a built-in feature to generate an insight report on data and quality. This report works either on the entire dataset that you import into Data Wrangler or on the sample if you sampled the data after importing data into Data Wrangler. The report gives you a quick insight into common data issues such as imbalances in the data, target leakage, and...

Data analysis

Closely related to generating insights on data and quality is the ability to quickly analyze the imported data. Data analysis is part of the data profiling phase and lets you get a better understanding of the data before you can move to processing your data for ML. SageMaker Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. In addition, you can also create your own custom analysis using custom code. Using data visualizations, you can get a quick overview of your entire dataset. It provides an accessible way to see and understand trends, outliers, and patterns in data. Data Wrangler provides out-of-the-box analysis tools, including histograms and scatter plots. You can create these visualizations with a few clicks and customize them with your own code. In addition to the visualizations, under analysis, you can also create table summaries. Table summaries enable data practitioners to quickly summarize your...

Data export

So far, we’ve looked at Data Wrangler capabilities that enable you to import data into Data Wrangler and perform analysis and transformations. SageMaker Data Wrangler enables you to export all or part of these transformations as a data flow. In most cases, data processing consists of a series of transformations. Each of these transformations can be referred to as a step in Data Wrangler. A Data Wrangler flow is made up of a series of nodes that represent the import of your data and the transformations that you’ve performed. As we covered earlier, one of the first steps in Data Wrangler is to import data from a supported data source. As such, the data source is the first node in your data flow. Following the previous step, the next node in the data flow is the Data Types node. This node signifies that Data Wrangler has executed a transformation to convert the dataset into a format that is suitable for further analysis and processing. Each transformation that...

SageMaker Studio setup prerequisites

SageMaker Data Wrangler is available as a service within Amazon SageMaker Studio. While you can still use some of the SageMaker Data Wrangler features via APIs, for the purposes of this book, we will be using Data Wrangler from within SageMaker Studio. In this section, we will cover a brief overview of SageMaker Studio and how to set up a SageMaker Studio domain and users in your AWS account.

Prerequisites

Before we can start setting up SageMaker Studio, there are a few prerequisites, as follows:

An AWS account.
An Identity and Access Management (IAM) role with the appropriate policy and permissions attached. There is an AmazonSageMakerFullAccess AWS managed policy that you can use as is or as a starting point to create your custom policy.

Studio domain

You will start by creating and onboarding a SageMaker domain using the AWS console. A SageMaker domain includes an Amazon Elastic File System (Amazon EFS) volume, a list...

Summary

In this chapter, we covered an introduction to data processing on AWS, specifically focusing on ML and data science. We looked at how data processing for ML is unique and why it is such a critical and significant component of the overall ML workflow. We went through some of the challenges when dealing with large and distributed datasets and data sources and how to work with these at scale. We discussed the importance of having a reliable and repeatable data processing workflow for ML. We then covered some of the key capabilities that are needed in tooling and the frameworks used for data processing for ML, which include the ability to detect bias present in real-world data, the ability to detect and fix data imbalances, the ability to perform quick and error-free transformations and run preprocessing reports and visualizations at scale, as well as the ability to ingest data at scale.

As enterprises move from experimentation and research to production, the focus switches...

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages