Reader small image

You're reading from  Machine Learning Engineering on AWS

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803247595
Edition1st Edition
Tools
Right arrow
Author (1)
Joshua Arvin Lat
Joshua Arvin Lat
author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Right arrow

SageMaker Training and Debugging Solutions

In Chapter 2, Deep Learning AMIs, and Chapter 3, Deep Learning Containers, we performed our initial ML training experiments inside EC2 instances. We took note of the cost per hour of running these EC2 instances as there are some cases where we would need to use the more expensive instance types (such as the p2.8xlarge instance at approximately $7.20 per hour) to run our ML training jobs and workloads. To manage and reduce the overall cost of running ML workloads using these EC2 instances, we discussed a few cost optimization strategies, including manually turning off these instances after the training job has finished.

At this point, you might be wondering if it is possible to automate the following processes:

  • Launching the EC2 instances that will run the ML training jobs
  • Uploading the model artifacts of the trained ML model to a storage location (such as an S3 bucket) after model training
  • Deleting the EC2 instances once...

Technical requirements

Before we start, we must have the following ready:

  • A web browser (preferably Chrome or Firefox)
  • Access to the AWS account that was used in the first few chapters of this book

The Jupyter notebooks, source code, and other files used for each chapter are available in this book’s GitHub repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Important Note

It is recommended to use an IAM user with limited permissions instead of the root account when running the examples in this book. We will discuss this, along with other security best practices, in detail in Chapter 9, Security, Governance, and Compliance Strategies. If you are just starting to use AWS, you may proceed with using the root account in the meantime.

Getting started with the SageMaker Python SDK

The SageMaker Python SDK is a library that allows ML practitioners to train and deploy ML models using the different features and capabilities of SageMaker. It provides several high-level abstractions such as Estimators, Models, Predictors, Sessions, Transformers, and Processors, all of which encapsulate and map to specific ML processes and entities. These abstractions allow data scientists and ML engineers to manage ML experiments and deployments with just a few lines of code. At the same time, infrastructure management is handled by SageMaker already, so all we need to do is configure these high-level abstractions with the correct set of parameters.

Note that it is also possible to use the different capabilities and features of SageMaker using the boto3 library. Compared to using the SageMaker Python SDK, we would be working with significantly more lines of code with boto3 since we would have to take care of the little details when...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready before proceeding with the hands-on solutions of this chapter:

  • We have a service limit increase to run SageMaker training jobs using the ml.p2.xlarge instance (SageMaker Training)
  • We have a service limit increase to run SageMaker training jobs using the ml.p2.xlarge instance (SageMaker Managed Spot Training)

If you are wondering why we are using ml.p2.xlarge instances in this chapter, that’s because we are required to use one of the supported instance types for the Image Classification Algorithm, as shown in the following screenshot:

Figure 6.2 – EC2 Instance Recommendation for the image classification algorithm

As we can see, we can use ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, and ml.p3.16xlarge (at the time of writing) when running training jobs using the Image Classification Algorithm...

Training an image classification model with the SageMaker Python SDK

As mentioned in the Getting started with the SageMaker Python SDK section, we can use built-in algorithms or custom algorithms (using scripts and custom Docker container images) when performing training experiments in SageMaker.

Data scientists and ML practitioners can get started with training and deploying models in SageMaker quickly using one or more of the built-in algorithms prepared by the AWS team. There are a variety of built-in algorithms to choose from and each of these algorithms has been provided to help ML practitioners solve specific business and ML problems. Here are some of the built-in algorithms available, along with some of the use cases and problems these can solve:

  • DeepAR Forecasting: Time-series forecasting
  • Principal Component Analysis: Dimensionality reduction
  • IP Insights: IP anomaly detection
  • Latent Dirichlet Allocation (LDA): Topic modeling
  • Sequence-to-Sequence...

Using the Debugger Insights Dashboard

When working on ML requirements, ML practitioners may encounter a variety of issues before coming up with a high-performing ML model. Like software development and programming, building ML models requires a bit of trial and error. Developers generally make use of a variety of debugging tools to help them troubleshoot issues and implementation errors when writing software applications. Similarly, ML practitioners need a way to monitor and debug training jobs when building ML models. Luckily for us, Amazon SageMaker has a capability called SageMaker Debugger that allows us to troubleshoot different issues and bottlenecks when training ML models:

Figure 6.24 – SageMaker Debugger features

The preceding diagram shows the features that are available when we use SageMaker Debugger to monitor, debug, and troubleshoot a variety of issues that affect an ML model’s performance. This includes the data capture capability...

Utilizing Managed Spot Training and Checkpoints

Now that we have a better understanding of how to use the SageMaker Python SDK to train and deploy ML models, let’s proceed with using a few additional options that allow us to reduce costs significantly when running training jobs. In this section, we will utilize the following SageMaker features and capabilities when training a second Image Classification model:

  • Managed Spot Training
  • Checkpointing
  • Incremental Training

In Chapter 2, Deep Learning AMIs, we mentioned that spot instances can be used to reduce the cost of running training jobs. Using spot instances instead of on-demand instances can help reduce the overall cost by up to 70% to 90%. So, why are spot instances cheaper? The downside of using spot instances is that these instances can be interrupted, which will restart the training job from the start. If we were to train our models outside of SageMaker, we would have to prepare our own set of custom...

Cleaning up

Follow these steps to locate and turn off any remaining running instances in SageMaker Studio:

  1. Click the Running Instances and Kernels icon in the sidebar of Amazon SageMaker Studio, as highlighted in the following screenshot:

Figure 6.34 – Turning off any remaining running instances

Clicking the Running Instances and Kernels icon should open and show the running instances, apps, and terminals in SageMaker Studio.

  1. Turn off any remaining running instances under RUNNING INSTANCES by clicking the Shutdown button for each of the instances as highlighted in the preceding screenshot. Clicking the Shutdown button will open a pop-up window verifying the instance shutdown operation. Click the Shut down all button to proceed.

Note that this cleanup operation needs to be performed after using SageMaker Studio. These resources are not turned off automatically by SageMaker, even during periods of inactivity. Turning off unused...

Summary

In this chapter, we trained and deployed ML models using the SageMaker Python SDK. We started by using the MNIST dataset (training dataset) and SageMaker’s built-in Image Classification Algorithm to train an image classifier model. After that, we took a closer look at the resources used during the training step by using the Debugger Insights Dashboard available in SageMaker Studio. Finally, we performed a second training experiment that made use of several features and options available in SageMaker, such as managed spot training, checkpointing, and incremental training.

In the next chapter, we will dive deeper into the different deployment options and strategies when performing model deployments using SageMaker. We will be deploying a pre-trained model into a variety of inference endpoint types, including the real-time, serverless, and asynchronous inference endpoints.

Further reading

For more information on the topics that were covered in this chapter, feel free to check out the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering on AWS
Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat