Reader small image

You're reading from  Machine Learning Engineering on AWS

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803247595
Edition1st Edition
Tools
Right arrow
Author (1)
Joshua Arvin Lat
Joshua Arvin Lat
author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Right arrow

Pragmatic Data Processing and Analysis

Data needs to be analyzed, transformed, and processed first before using it when training machine learning (ML) models. In the past, data scientists and ML practitioners had to write custom code from scratch using a variety of libraries, frameworks, and tools (such as pandas and PySpark) to perform the needed analysis and processing work. The custom code prepared by these professionals often needed tweaking since different variations of the steps programmed in the data processing scripts had to be tested on the data before being used for model training. This takes up a significant portion of an ML practitioner’s time, and since this is a manual process, it is usually error-prone as well.

One of the more practical ways to process and analyze data involves the usage of no-code or low-code tools when loading, cleaning, analyzing, and transforming the raw data from different data sources. Using these types of tools will significantly speed...

Technical requirements

Before we start, it is important that we have the following ready:

  • A web browser (preferably Chrome or Firefox)
  • Access to the AWS account used in the first four chapters of the book

The Jupyter notebooks, source code, and other files used for each chapter are available in this repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Important Note

Make sure to sign out and NOT use the IAM user created in Chapter 4, Serverless Data Management on AWS. In this chapter, you should use the root account or a new IAM user with a set of permissions to create and manage the AWS Glue DataBrew, Amazon S3, AWS CloudShell, and Amazon SageMaker resources. It is recommended to use an IAM user with limited permissions instead of the root account when running the examples in this book. We will discuss this along with other security best practices in further detail in Chapter 9, Security, Governance, and Compliance Strategies...

Getting started with data processing and analysis

In the previous chapter, we utilized a data warehouse and a data lake to store, manage, and query our data. Data stored in these data sources generally must undergo a series of data processing and data transformation steps similar to those shown in Figure 5.1 before it can be used as a training dataset for ML experiments:

Figure 5.1 – Data processing and analysis

In Figure 5.1, we can see that these data processing steps may involve merging different datasets, along with cleaning, converting, analyzing, and transforming the data using a variety of options and techniques. In practice, data scientists and ML engineers generally spend a lot of hours cleaning the data and getting it ready for use in ML experiments. Some professionals may be used to writing and running custom Python or R scripts to perform this work. However, it may be more practical to use no-code or low-code solutions such as AWS Glue DataBrew...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready before proceeding with the hands-on solutions of this chapter:

  • The Parquet file to be analyzed and processed
  • The S3 bucket where the Parquet file will be uploaded

Downloading the Parquet file

In this chapter, we will work with a similar bookings dataset as the one used in previous chapters. However, the source data is stored in a Parquet file this time, and we have modified some of the rows so that the dataset will have dirty data. That said, let’s download the synthetic.bookings.dirty.parquet file onto our local machine.

You can find it here: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS/raw/main/chapter05/synthetic.bookings.dirty.parquet.

Note

Note that storing data using the Parquet format is preferable to storing data using the CSV format. Once you need to work with much larger datasets, the difference...

Automating data preparation and analysis with AWS Glue DataBrew

AWS Glue DataBrew is a no-code data preparation service built to help data scientists and ML engineers clean, prepare, and transform data. Similar to the services we used in Chapter 4, Serverless Data Management on AWS, Glue DataBrew is serverless as well. This means that we won’t need to worry about infrastructure management when using this service to perform data preparation, transformation, and analysis.

Figure 5.2 – The core concepts in AWS Glue DataBrew

In Figure 5.2, we can see that there are different concepts and resources involved when using AWS Glue DataBrew. We need to have a good idea of what these are before using the service. Here is a quick overview of the concepts and terms used:

  • Dataset – Data stored in an existing data source (for example, Amazon S3, Amazon Redshift, or Amazon RDS) or uploaded from the local machine to an S3 bucket.
  • Recipe –...

Preparing ML data with Amazon SageMaker Data Wrangler

Amazon SageMaker has a lot of capabilities and features to assist data scientists and ML engineers with the different ML requirements. One of the capabilities of SageMaker focused on accelerating data preparation and data analysis is SageMaker Data Wrangler:

Figure 5.18 – The primary functionalities available in SageMaker Data Wrangler

In Figure 5.18, we can see what we can do with our data when using SageMaker Data Wrangler:

  1. First, we can import data from a variety of data sources such as Amazon S3, Amazon Athena, and Amazon Redshift.
  2. Next, we can create a data flow and transform the data using a variety of data formatting and data transformation options. We can also analyze and visualize the data using both inbuilt and custom options in just a few clicks.
  3. Finally, we can automate the data preparation workflows by exporting one or more of the transformations configured in the...

Summary

Data needs to be cleaned, analyzed, and prepared before it is used to train ML models. Since it takes time and effort to work on these types of requirements, it is recommended to use no-code or low-code solutions such as AWS Glue DataBrew and Amazon SageMaker Data Wrangler when analyzing and processing our data. In this chapter, we were able to use these two services to analyze and process our sample dataset. Starting with a sample “dirty” dataset, we performed a variety of transformations and operations, which included (1) profiling and analyzing the data, (2) filtering out rows containing invalid data, (3) creating a new column from an existing one, (4) exporting the results into an output location, and (5) verifying whether the transformations have been applied to the output file.

In the next chapter, we will take a closer look at Amazon SageMaker and we will dive deeper into how we can use this managed service when performing machine learning experiments...

Further reading

For more information on the topics covered in this chapter, feel free to check out the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering on AWS
Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat