Reader small image

You're reading from  Serverless Machine Learning with Amazon Redshift ML

Product typeBook
Published inAug 2023
Reading LevelBeginner
PublisherPackt
ISBN-139781804619285
Edition1st Edition
Languages
Right arrow
Authors (4):
Debu Panda
Debu Panda
author image
Debu Panda

Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).
Read more about Debu Panda

Phil Bates
Phil Bates
author image
Phil Bates

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.
Read more about Phil Bates

Bhanu Pittampally
Bhanu Pittampally
author image
Bhanu Pittampally

Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.
Read more about Bhanu Pittampally

Sumeet Joshi
Sumeet Joshi
author image
Sumeet Joshi

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.
Read more about Sumeet Joshi

View More author details
Right arrow

Creating a Custom ML Model with XGBoost

So far, all of the supervised learning models we have explored have utilized the Amazon Redshift Auto ML feature, which uses Amazon SageMaker Autopilot behind the scenes. In this chapter, we will explore how to create custom machine learning (ML) models. Training a custom model gives you the flexibility to choose the model type and the hyperparameters to use. This chapter will provide examples of this modeling technique. By the end of this chapter, you will know how to create a custom XGBoost model and how to prepare the data to train your model using Redshift SQL.

In this chapter, we will go through the following main topics:

  • Introducing XGBoost
  • Introducing an XGBoost use case
  • XGBoost model with Auto off feature

Technical requirements

This chapter requires a web browser and access to the following:

  • An AWS account
  • An Amazon Redshift Serverless endpoint
  • Amazon Redshift Query Editor v2

You can find the code used in this chapter here:

https://github.com/PacktPublishing/Serverless-Machine-Learning-with-Amazon-Redshift/blob/main/CodeFiles/chapter10/chapter10.sql

Introducing XGBoost

XGBoost gets its name because it is built on the Gradient Boosting framework. Using a tree-boosting technique provides a fast method for solving ML problems. As you have seen in previous chapters, you can specify the model type, which can help speed up model training since SageMaker Autopilot does not have to determine which model type to use.

You can learn more about XGBoost here: https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.

When you create a model with Redshift ML and specify XGBoost as the model type, and optionally specify AUTO OFF, this turns off SageMaker Autopilot and you have more control of model tuning. For example, you can specify the hyperparameters you wish to use. You will see an example of this in the Creating a binary classification model using XGBoost section.

You will have to perform preprocessing when you set AUTO to OFF. Carrying out the preprocessing ensures we will get the best possible model and is also necessary...

Introducing an XGBoost use case

In this section, we will be discussing a use case where we want to predict whether credit card transactions are fraudulent. We will be going through the following steps:

  • Defining the business problem
  • Uploading, analyzing, and preparing data for training
  • Splitting data into training and testing datasets
  • Preprocessing the input variables

Defining the business problem

In this section, we will use a credit card payment transaction dataset to build a binary classification model using XGBoost in Redshift ML. This dataset contains customer and terminal information along with the date and amount related to the transaction. This dataset also has some derived fields based on recency, frequency, and monetary numeric features, along with a few categorical variables, such as whether a transaction occurred during the weekend or at night. Our goal is to identify whether a transaction is fraudulent or non-fraudulent. This use case is taken...

Creating a model using XGBoost with Auto Off

In this exercise, we are going to create a custom binary classification model using the XGBoost algorithm. You can achieve this by setting AUTO off. Here are the parameters that are available:

  • AUTO OFF
  • MODEL_TYPE
  • OBJECTIVE
  • HYPERPARAMETERS

For the complete list of hyperparameter values that are available and their defaults, please read the documentation found here:

https://docs.aws.amazon.com/redshift/latest/dg/r_create_model_use_cases.html#r_auto_off_create_model

Now that you have a basic understanding of the parameters available with XGBoost, you can create the model.

Creating a binary classification model using XGBoost

Let’s create a model to predict whether a transaction is fraudulent or non-fraudulent. As you learned in the previous chapters, creating models with Amazon Redshift ML is simply done by running a SQL command that creates a function. As inputs (or features), you will be using...

Summary

In this chapter, you learned what XGBoost is and how to apply it to a business problem. You learned how to specify your own hyperparameters when using the Auto Off option and how to specify the objective for a binary classification problem. Additionally, you learned how to do your own data preprocessing and calculate the F1 score to validate the model performance.

In the next chapter, you will learn how to bring your own models from Amazon SageMaker for in-database or remote inference.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless Machine Learning with Amazon Redshift ML
Published in: Aug 2023Publisher: PacktISBN-13: 9781804619285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Debu Panda

Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).
Read more about Debu Panda

author image
Phil Bates

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.
Read more about Phil Bates

author image
Bhanu Pittampally

Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.
Read more about Bhanu Pittampally

author image
Sumeet Joshi

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.
Read more about Sumeet Joshi