You're reading from Machine Learning Engineering on AWS

Product typeBook

Published inOct 2022

PublisherPackt

ISBN-139781803247595

Edition1st Edition

Tools

AWS

Concepts

Machine Learning

Author (1)

Joshua Arvin Lat

Pragmatic Data Processing and Analysis

Data needs to be analyzed, transformed, and processed first before using it when training machine learning (ML) models. In the past, data scientists and ML practitioners had to write custom code from scratch using a variety of libraries, frameworks, and tools (such as pandas and PySpark) to perform the needed analysis and processing work. The custom code prepared by these professionals often needed tweaking since different variations of the steps programmed in the data processing scripts had to be tested on the data before being used for model training. This takes up a significant portion of an ML practitioner’s time, and since this is a manual process, it is usually error-prone as well.

One of the more practical ways to process and analyze data involves the usage of no-code or low-code tools when loading, cleaning, analyzing, and transforming the raw data from different data sources. Using these types of tools will significantly speed...

Technical requirements

Before we start, it is important that we have the following ready:

A web browser (preferably Chrome or Firefox)
Access to the AWS account used in the first four chapters of the book

The Jupyter notebooks, source code, and other files used for each chapter are available in this repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Important Note

Make sure to sign out and NOT use the IAM user created in Chapter 4, Serverless Data Management on AWS. In this chapter, you should use the root account or a new IAM user with a set of permissions to create and manage the AWS Glue DataBrew, Amazon S3, AWS CloudShell, and Amazon SageMaker resources. It is recommended to use an IAM user with limited permissions instead of the root account when running the examples in this book. We will discuss this along with other security best practices in further detail in Chapter 9, Security, Governance, and Compliance Strategies...

Getting started with data processing and analysis

In the previous chapter, we utilized a data warehouse and a data lake to store, manage, and query our data. Data stored in these data sources generally must undergo a series of data processing and data transformation steps similar to those shown in Figure 5.1 before it can be used as a training dataset for ML experiments:

Figure 5.1 – Data processing and analysis

In Figure 5.1, we can see that these data processing steps may involve merging different datasets, along with cleaning, converting, analyzing, and transforming the data using a variety of options and techniques. In practice, data scientists and ML engineers generally spend a lot of hours cleaning the data and getting it ready for use in ML experiments. Some professionals may be used to writing and running custom Python or R scripts to perform this work. However, it may be more practical to use no-code or low-code solutions such as AWS Glue DataBrew...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready before proceeding with the hands-on solutions of this chapter:

The Parquet file to be analyzed and processed
The S3 bucket where the Parquet file will be uploaded

Downloading the Parquet file

In this chapter, we will work with a similar bookings dataset as the one used in previous chapters. However, the source data is stored in a Parquet file this time, and we have modified some of the rows so that the dataset will have dirty data. That said, let’s download the synthetic.bookings.dirty.parquet file onto our local machine.

You can find it here: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS/raw/main/chapter05/synthetic.bookings.dirty.parquet.

Note

Note that storing data using the Parquet format is preferable to storing data using the CSV format. Once you need to work with much larger datasets, the difference...

Automating data preparation and analysis with AWS Glue DataBrew

AWS Glue DataBrew is a no-code data preparation service built to help data scientists and ML engineers clean, prepare, and transform data. Similar to the services we used in Chapter 4, Serverless Data Management on AWS, Glue DataBrew is serverless as well. This means that we won’t need to worry about infrastructure management when using this service to perform data preparation, transformation, and analysis.

Figure 5.2 – The core concepts in AWS Glue DataBrew

In Figure 5.2, we can see that there are different concepts and resources involved when using AWS Glue DataBrew. We need to have a good idea of what these are before using the service. Here is a quick overview of the concepts and terms used:

Dataset – Data stored in an existing data source (for example, Amazon S3, Amazon Redshift, or Amazon RDS) or uploaded from the local machine to an S3 bucket.
Recipe –...

Preparing ML data with Amazon SageMaker Data Wrangler

Amazon SageMaker has a lot of capabilities and features to assist data scientists and ML engineers with the different ML requirements. One of the capabilities of SageMaker focused on accelerating data preparation and data analysis is SageMaker Data Wrangler:

Figure 5.18 – The primary functionalities available in SageMaker Data Wrangler

In Figure 5.18, we can see what we can do with our data when using SageMaker Data Wrangler:

First, we can import data from a variety of data sources such as Amazon S3, Amazon Athena, and Amazon Redshift.
Next, we can create a data flow and transform the data using a variety of data formatting and data transformation options. We can also analyze and visualize the data using both inbuilt and custom options in just a few clicks.
Finally, we can automate the data preparation workflows by exporting one or more of the transformations configured in the...

Summary

Data needs to be cleaned, analyzed, and prepared before it is used to train ML models. Since it takes time and effort to work on these types of requirements, it is recommended to use no-code or low-code solutions such as AWS Glue DataBrew and Amazon SageMaker Data Wrangler when analyzing and processing our data. In this chapter, we were able to use these two services to analyze and process our sample dataset. Starting with a sample “dirty” dataset, we performed a variety of transformations and operations, which included (1) profiling and analyzing the data, (2) filtering out rows containing invalid data, (3) creating a new column from an existing one, (4) exporting the results into an output location, and (5) verifying whether the transformations have been applied to the output file.

In the next chapter, we will take a closer look at Amazon SageMaker and we will dive deeper into how we can use this managed service when performing machine learning experiments...

AWS Glue DataBrew product and service integrations (https://docs.aws.amazon.com/databrew/latest/dg/databrew-integrations.html)
Security in AWS Glue DataBrew (https://docs.aws.amazon.com/databrew/latest/dg/security.html)
Create and Use a Data Wrangler Flow (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-flow.html)
Data Wrangler – Transform (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html)
Data Wrangler – Troubleshooting (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-trouble-shooting.html)

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning Engineering on AWS

Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Machine Learning Engineering on AWS

Pragmatic Data Processing and Analysis

Technical requirements

Getting started with data processing and analysis

Preparing the essential prerequisites

Downloading the Parquet file

Automating data preparation and analysis with AWS Glue DataBrew

Preparing ML data with Amazon SageMaker Data Wrangler

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook