Reader small image

You're reading from  Machine Learning Engineering on AWS

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803247595
Edition1st Edition
Tools
Right arrow
Author (1)
Joshua Arvin Lat
Joshua Arvin Lat
author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Right arrow

SageMaker Deployment Solutions

After training our machine learning (ML) model, we can proceed with deploying it to a web API. This API can then be invoked by other applications (for example, a mobile application) to perform a “prediction” or inference. For example, the ML model we trained in Chapter 1, Introduction to ML Engineering on AWS, can be deployed to a web API and then be used to predict the likelihood of customers canceling their reservations or not, given a set of inputs. Deploying the ML model to a web API allows the ML model to be accessible to different applications and systems.

A few years ago, ML practitioners had to spend time building a custom backend API to host and deploy a model from scratch. If you were given this requirement, you might have used a Python framework such as Flask, Pyramid, or Django to deploy the ML model. Building a custom API to serve as an inference endpoint can take about a week or so since most of the application logic needs...

Technical requirements

Before we start, it is important to have the following ready:

  • A web browser (preferably Chrome or Firefox)
  • Access to the AWS account and SageMaker Studio domain used in the first chapter of the book

The Jupyter notebooks, source code, and other files used for each chapter are available in this repository: https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Important Note

It is recommended to use an IAM user with limited permissions instead of the root account when running the examples in this book. We will discuss this along with other security best practices in detail in Chapter 9, Security, Governance, and Compliance Strategies. If you are just starting with using AWS, you may proceed with using the root account in the meantime.

Getting started with model deployments in SageMaker

In Chapter 6, SageMaker Training and Debugging Solutions, we trained and deployed an image classification model using the SageMaker Python SDK. We made use of a built-in algorithm while working on the hands-on solutions in that chapter. When using a built-in algorithm, we just need to prepare the training dataset along with specifying a few configuration parameters and we are good to go! Note that if we want to train a custom model using our favorite ML framework (such as TensorFlow and PyTorch), then we can prepare our custom scripts and make them work in SageMaker using script mode. This gives us a bit more flexibility since we can tweak how SageMaker interfaces with our model through a custom script that allows us to use different libraries and frameworks when training our model. If we want the highest level of flexibility for the environment where the training scripts will run, then we can opt to use our own custom container image...

Preparing the pre-trained model artifacts

In Chapter 6, SageMaker Training and Debugging Solutions, we created a new folder named CH06, along with a new Notebook using the Data Science image inside the created folder. In this section, we will create a new folder (named CH07), along with a new Notebook inside the created folder. Instead of the Data Science image, we will use the PyTorch 1.10 Python 3.8 CPU Optimized image as the image used in the Notebook since we will download the model artifacts of a pre-trained PyTorch model using the Hugging Face transformers library. Once the Notebook is ready, we will use the Hugging Face transformers library to download a pre-trained model that can be used for sentiment analysis. Finally, we will zip the model artifacts into a model.tar.gz file and upload it to an S3 bucket.

Note

Make sure that you have completed the hands-on solutions in the Getting started with SageMaker and SageMaker Studio section of Chapter 1, Introduction to ML Engineering...

Preparing the SageMaker script mode prerequisites

In this chapter, we will be preparing a custom script to use a pre-trained model for predictions. Before we can proceed with using the SageMaker Python SDK to deploy our pre-trained model to an inference endpoint, we’ll need to ensure that all the script mode prerequisites are ready.

Figure 7.4 – The desired file and folder structure

In Figure 7.4, we can see that there are three prerequisites we’ll need to prepare:

  • inference.py
  • requirements.txt
  • setup.py

We will store these prerequisites inside the scripts directory. We’ll discuss these prerequisites in detail in the succeeding pages of this chapter. Without further ado, let’s start by preparing the inference.py script file!

Preparing the inference.py file

In this section, we will prepare a custom Python script that will be used by SageMaker when processing inference requests. Here, we can influence...

Deploying a pre-trained model to a real-time inference endpoint

In this section, we will use the SageMaker Python SDK to deploy a pre-trained model to a real-time inference endpoint. From the name itself, we can tell that a real-time inference endpoint can process input payloads and perform predictions in real time. If you have built an API endpoint before (which can process GET and POST requests, for example), then we can think of an inference endpoint as an API endpoint that accepts an input request and returns a prediction as part of a response. How are predictions made? The inference endpoint simply loads the model into memory and uses it to process the input payload. This will yield an output that is returned as a response. For example, if we have a pre-trained sentiment analysis ML model deployed in a real-time inference endpoint, then it would return a response of either "POSITIVE" or "NEGATIVE" depending on the input string payload provided in the request...

Deploying a pre-trained model to a serverless inference endpoint

In the initial chapters of this book, we’ve worked with several serverless services that allow us to manage and reduce costs. If you are wondering whether there’s a serverless option when deploying ML models in SageMaker, then the answer to that would be a sweet yes. When you are dealing with intermittent and unpredictable traffic, using serverless inference endpoints to host your ML model can be a more cost-effective option. Let’s say that we can tolerate cold starts (where a request takes longer to process after periods of inactivity) and we only expect a few requests per day – then, we can make use of a serverless inference endpoint instead of the real-time option. Real-time inference endpoints are best used when we can maximize the inference endpoint. If you’re expecting your endpoint to be utilized most of the time, then the real-time option may do the trick.

...

Deploying a pre-trained model to an asynchronous inference endpoint

In addition to real-time and serverless inference endpoints, SageMaker also offers a third option when deploying models – asynchronous inference endpoints. Why is it called asynchronous? For one thing, instead of expecting the results to be available immediately, requests are queued, and results are made available asynchronously. This works for ML requirements that involve one or more of the following:

  • Large input payloads (up to 1 GB)
  • A long prediction processing duration (up to 15 minutes)

A good use case for asynchronous inference endpoints would be for ML models that are used to detect objects in large video files (which may take more than 60 seconds to complete). In this case, an inference may take a few minutes instead of a few seconds.

How do we use asynchronous inference endpoints? To invoke an asynchronous inference endpoint, we do the following:

  1. The request payload is...

Cleaning up

Now that we have completed working on the hands-on solutions of this chapter, it is time for us to clean up and turn off any resources we will no longer use. In the next set of steps, we will locate and turn off any remaining running instances in SageMaker Studio:

  1. Click the Running Instances and Kernels icon in the sidebar, as highlighted in Figure 7.21:

Figure 7.21 – Turning off the running instance

Clicking the Running Instances and Kernels icon should open and show the running instances, apps, and terminals in SageMaker Studio.

  1. Turn off all running instances under RUNNING INSTANCES by clicking the Shut down button for each of the instances, as highlighted in Figure 7.21. Clicking the Shut down button will open a pop-up window verifying the instance shutdown operation. Click the Shut down all button to proceed.
  2. Make sure to check for and delete all the running inference endpoints under SageMaker resources as well...

Deployment strategies and best practices

In this section, we will discuss the relevant deployment strategies and best practices when using the SageMaker hosting services. Let’s start by talking about the different ways we can invoke an existing SageMaker inference endpoint. The solution we’ve been using so far involves the usage of the SageMaker Python SDK to invoke an existing endpoint:

from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
endpoint_name = "<INSERT NAME OF EXISTING ENDPOINT>"
predictor = Predictor(endpoint_name=endpoint_name)
predictor.serializer = JSONSerializer() 
predictor.deserializer = JSONDeserializer()
payload = {
^  "text": "I love reading the book MLE on AWS!"
}
predictor.predict(payload)

Here, we initialize a Predictor object and point it to an existing inference endpoint during the initialization step...

Summary

In this chapter, we discussed and focused on several deployment options and solutions using SageMaker. We deployed a pre-trained model into three different types of inference endpoints – (1) a real-time inference endpoint, (2) a serverless inference endpoint, and (3) an asynchronous inference endpoint. We also discussed the differences of each approach, along with when each option is best used when deploying ML models. Toward the end of this chapter, we talked about some of the deployment strategies, along with the best practices when using SageMaker for model deployments.

In the next chapter, we will dive deeper into SageMaker Model Registry and SageMaker Model Monitor, which are capabilities of SageMaker that can help us manage and monitor our models in production.

Further reading

For more information on the topics covered in this chapter, feel free to check out the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering on AWS
Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat