Reader small image

You're reading from  Machine Learning Engineering on AWS

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803247595
Edition1st Edition
Tools
Right arrow
Author (1)
Joshua Arvin Lat
Joshua Arvin Lat
author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat

Right arrow

Machine Learning Pipelines with SageMaker Pipelines

In Chapter 10, Machine Learning Pipelines with Kubeflow on Amazon EKS, we used Kubeflow, Kubernetes, and Amazon EKS to build and run an end-to-end machine learning (ML) pipeline. Here, we were able to automate several steps in the ML process inside a running Kubernetes cluster. If you are wondering whether we can also build ML pipelines using the different features and capabilities of SageMaker, then the quick answer to that would be YES!

In this chapter, we will use SageMaker Pipelines to build and run automated ML workflows. In addition to this, we will demonstrate how we can utilize AWS Lambda functions to deploy trained models to new (or existing) ML inference endpoints during pipeline execution.

That said, in this chapter, we will cover the following topics:

  • Diving deeper into SageMaker Pipelines
  • Preparing the essential prerequisites
  • Running our first pipeline with SageMaker Pipelines
  • Creating Lambda...

Technical requirements

Before we start, it is important that we have the following ready:

  • A web browser (preferably Chrome or Firefox)
  • Access to the AWS account and the SageMaker Studio domain used in the previous chapters of this book
  • A text editor (for example, VS Code) on your local machine that we will use for storing and copying string values for later use in this chapter

The Jupyter notebooks, source code, and other files used for each chapter are available in the repository at https://github.com/PacktPublishing/Machine-Learning-Engineering-on-AWS.

Important Note

It is recommended that you use an IAM user with limited permissions instead of the root account when running the examples in this book. If you are just starting out with using AWS, you can proceed with using the root account in the meantime.

Diving deeper into SageMaker Pipelines

Often, data science teams start by performing ML experiments and deployments manually. Once they need to standardize the workflow and enable automated model retraining to refresh the deployed models regularly, these teams would then start considering the use of ML pipelines to automate a portion of their work. In Chapter 6, SageMaker Training and Debugging Solutions, we learned how to use the SageMaker Python SDK to train an ML model. Generally, training an ML model with the SageMaker Python SDK involves running a few lines of code similar to what we have in the following block of code:

estimator = Estimator(...) 
estimator.set_hyperparameters(...)
estimator.fit(...)

What if we wanted to prepare an automated ML pipeline and include this as one of the steps? You would be surprised that all we need to do is add a few lines of code to convert this into a step that can be included in a pipeline! To convert this into a step using SageMaker Pipelines...

Preparing the essential prerequisites

In this section, we will ensure that the following prerequisites are ready:

  • The SageMaker Studio Domain execution role with the AWSLambda_FullAccess AWS managed permission policy attached to it – This will allow the Lambda functions to run without issues in the Completing the end-to-end ML pipeline section of this chapter.
  • The IAM role (pipeline-lambda-role) – This will be used to run the Lambda functions in the Creating Lambda Functions for Deployment section of this chapter.
  • The processing.py file – This will be used by the SageMaker Processing job to process the input data and split it into training, validation, and test sets.
  • The bookings.all.csv file – This will be used as the input dataset for the ML pipeline.

Important Note

In this chapter, we will create and manage our resources in the Oregon (us-west-2) region. Make sure that you have set the correct region before proceeding with...

Running our first pipeline with SageMaker Pipelines

In Chapter 1, Introduction to ML Engineering on AWS, we installed and used AutoGluon to train multiple ML models (with AutoML) inside an AWS Cloud9 environment. In addition to this, we performed the different steps of the ML process manually using a variety of tools and libraries. In this chapter, we will convert these manually executed steps into an automated pipeline so that all we need to do is provide an input dataset and the ML pipeline will do the rest of the work for us (and store the trained model in a model registry).

Note

Instead of preparing a custom Docker container image to use AutoGluon for training ML models, we will use the built-in AutoGluon-Tabular algorithm instead. With a built-in algorithm available for use, all we need to worry about would be the hyperparameter values and the additional configuration parameters we will use to configure the training job.

That said, this section is divided into two parts...

Creating Lambda functions for deployment

Our second (and more complete pipeline) will require a few additional resources to help us deploy our ML model. In this section, we will create the following Lambda functions:

  • check-if-endpoint-exists – This is a Lambda function that accepts the name of the ML inference endpoint as input and returns True if the endpoint exists already.
  • deploy-model-to-new-endpoint – This is a Lambda function that accepts the model package ARN as input (along with the role and the endpoint name) and deploys the model into a new inference endpoint
  • deploy-model-to-existing-endpoint – This is a Lambda function that accepts the model package ARN as input (along with the role and the endpoint name) and deploys the model into an existing inference endpoint (by updating the deployed model inside the ML instance)

We will use these functions later in the Completing the end-to-end ML pipeline section to deploy the ML model we...

Testing our ML inference endpoint

Of course, we need to check whether the ML inference endpoint is working! In the next set of steps, we will download and run a Jupyter notebook (named Test Endpoint and then Delete.ipynb) that tests our ML inference endpoint using the test dataset:

  1. Let’s begin by opening the following link in another browser tab: https://bit.ly/3xyVAXz
  2. Right-click on any part of the page to open a context menu, and then choose Save as... from the list of available options. Save the file as Test Endpoint then Delete.ipynb, and then download it to the Downloads folder (or similar) on your local machine.
  3. Navigate back to your SageMaker Studio environment. In the File Tree (located on the left-hand side of the SageMaker Studio environment), make sure that you are in the CH11 folder similar to what we have in Figure 11.15:

Figure 11.15 – Uploading the test endpoint and then the Delete.ipynb file

  1. Click on the...

Completing the end-to-end ML pipeline

In this section, we will build on top of the (partial) pipeline we prepared in the Running our first pipeline with SageMaker Pipelines section of this chapter. In addition to the steps and resources used to build our partial pipeline, we will also utilize the Lambda functions we created (in the Creating Lambda functions for deployment section) to complete our ML pipeline.

Defining and preparing the complete ML pipeline

The second pipeline we will prepare would be slightly longer than the first pipeline. To help us visualize how our second ML pipeline using SageMaker Pipelines will look like, let’s quickly check Figure 11.16:

Figure 11.16 – Our second ML pipeline using SageMaker Pipelines

Here, we can see that our pipeline accepts two input parameters—the input dataset and the endpoint name. When the pipeline runs, the input dataset is first split into training, validation, and test sets. The...

Cleaning up

Now that we have completed working on the hands-on solutions of this chapter, it is time we clean up and turn off the resources we will no longer use. In the next set of steps, we will locate and turn off any remaining running instances in SageMaker Studio:

  1. Make sure to check and delete all running inference endpoints under SageMaker resources (if any). To check whether there are running inference endpoints, click on the SageMaker resources icon and then select Endpoints from the list of options in the drop-down menu.
  2. Open the File menu and select Shut down from the list of available options. This should turn off all running instances inside SageMaker Studio.

It is important to note that this cleanup operation needs to be performed after using SageMaker Studio. These resources are not turned off automatically by SageMaker even during periods of inactivity. Make sure to review whether all delete operations have succeeded before proceeding to the next section...

Summary

In this chapter, we used SageMaker Pipelines to build end-to-end automated ML pipelines. We started by preparing a relatively simple pipeline with three steps—including the data preparation step, the model training step, and the model registration step. After preparing and defining the pipeline, we proceeded with triggering a pipeline execution that registered a newly trained model to the SageMaker Model Registry after the pipeline execution finished running.

Then, we prepared three AWS Lambda functions that would be used for the model deployment steps of the second ML pipeline. After preparing the Lambda functions, we proceeded with completing the end-to-end ML pipeline by adding a few additional steps to deploy the model to a new or existing ML inference endpoint. Finally, we discussed relevant best practices and strategies to secure, scale, and manage ML pipelines using the technology stack we used in this chapter.

You’ve finally reached the end of this...

Further reading

At this point, you might want to dive deeper into the relevant subtopics discussed by checking the references listed in the Further reading section of each of the previous chapters. In addition to these, you can check the following resources, too:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning Engineering on AWS
Published in: Oct 2022Publisher: PacktISBN-13: 9781803247595
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Joshua Arvin Lat

Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO for three Australian-owned companies and as director of software development and engineering for multiple e-commerce start-ups in the past. Years ago, he and his team won first place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and has shared his knowledge at several international conferences, discussing practical strategies on machine learning, engineering, security, and management.
Read more about Joshua Arvin Lat