Reader small image

You're reading from  Serverless Machine Learning with Amazon Redshift ML

Product typeBook
Published inAug 2023
Reading LevelBeginner
PublisherPackt
ISBN-139781804619285
Edition1st Edition
Languages
Right arrow
Authors (4):
Debu Panda
Debu Panda
author image
Debu Panda

Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).
Read more about Debu Panda

Phil Bates
Phil Bates
author image
Phil Bates

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.
Read more about Phil Bates

Bhanu Pittampally
Bhanu Pittampally
author image
Bhanu Pittampally

Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.
Read more about Bhanu Pittampally

Sumeet Joshi
Sumeet Joshi
author image
Sumeet Joshi

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.
Read more about Sumeet Joshi

View More author details
Right arrow

Building Unsupervised Models with K-Means Clustering

So far, we have learned about building machine learning (ML) models where data is supplied with labels. In this chapter, we will learn about building ML models on a dataset without any labels by using the K-means clustering algorithm. Unlike supervised models, where predictions are made at the observation level, K-means clustering groups observations into clusters where they share a commonality – for example, similar demographics or reading habits.

This chapter will provide detailed examples of business problems that can be solved with these modeling techniques. By the end of this chapter, you will be in a position to identify a business problem that an unsupervised modeling technique can be applied to. You will also learn how to build, train, and evaluate K-means model performance.

In this chapter, we will cover the following main topics:

  • Grouping data through cluster analysis
  • Creating a K-means ML model...

Technical requirements

This chapter requires a web browser and access to the following:

  • An AWS account
  • An Amazon Redshift Serverless endpoint
  • Amazon Redshift Query Editor v2
  • Complete the Getting started with Amazon Redshift Serverless section in Chapter 1

You can find the code used in this chapter here: https://github.com/PacktPublishing/Serverless-Machine-Learning-with-Amazon-Redshift/blob/main/CodeFiles/chapter8/chapter8.sql.

Grouping data through cluster analysis

So far, we have explored datasets that contained input and target variables, and we trained a model with a set of input variables and a target variable. This is called supervised learning. However, how do you address a dataset that does not contain a label to supervise the training? Amazon Redshift ML supports unsupervised learning using the cluster analysis method, also known as the K-means algorithm. In cluster analysis, the ML algorithm automatically discovers the grouping of data points. For example, if you have a population of 1,000 people, a clustering algorithm can group them based on height, weight, or age.

Unlike supervised learning, where an ML model predicts an outcome based on a label, unsupervised models use unlabeled data. One type of unsupervised learning is clustering, where unlabeled data is grouped based on its similarity or differences. From a dataset with demographic information about individuals, you can create clusters...

Determining the optimal number of clusters

One popular method that is frequently adopted is the Elbow method. The idea of the Elbow method is to run K-means algorithms with different values of K – for example, from 1 cluster all the way to 10 – and for each value of K, calculate the sum of squared errors. Then, plot a chart of the sum of squared deviation (SSD) values. SSD is the sum of the squared difference and is used to measure variance. If the line chart looks like an arm, then the elbow on the arm is the value of K that is the best among the various K values. The method behind this approach is that SSD usually tends to decrease as the value of K is increased, and the goal of the evaluation method is also to aim for lower SSD or mean squared deviation (MSD) values. The elbow represents a starting point, where SSD starts to have diminishing returns when the K value increases.

In the following chart, you can see that the MSD value, when charted over different K...

Creating a K-means ML model

In this section, we will walk through the process with the help of a use case. In this use case, assume you are a data analyst for an e-commerce company specializing in home improvement goods. You have been tasked with classifying economic segments in different regions, based on income, so that you can better target customers, based on various factors, such as median home value. We will use this dataset from Kaggle: https://www.kaggle.com/datasets/camnugent/california-housing-prices.

From this dataset, you will use the median_income, latitude, and longitude attributes so that you can create clusters based on location and income.

The syntax to create a K-means model is slightly different from what you will have used up to this point, so let’s dive into that.

Creating a model syntax overview for K-means clustering

Here is the basic syntax to create a K-means model:

CREATE model model_name
FROM (Select_statement)
FUNCTION  function_name...

Evaluating the results of the K-means clustering

Now that you have segmented your clusters with the K-means algorithm, you are ready to perform various analyses using the model you created.

Here is an example query you can run to get the average median house value by cluster:

select avg(median_house_value) as avg_median_house_value,
chapter8_kmeans_clustering
.get_housing_segment_k3(median_income, latitude, longitude) as cluster
from chapter8_kmeans_clustering
.housing_prices
group by 2
order by 1;

The output will look like this:

Figure 8.12 – Average median house values

Figure 8.12 – Average median house values

You can also run a query to see whether higher median incomes correspond to the same clusters with higher home values. Run the following query:

select avg(median_income) as median_income,
chapter8_kmeans_clustering.get_housing_segment_k3(
    median_income, latitude, longitude) as cluster
from chapter8_kmeans_clustering.housing_prices
group by 2
order...

Summary

In this chapter, we discussed how to do unsupervised learning with the K-means algorithm.

You are now able to explain what the K-means algorithm is and what use cases it is appropriate for. Also, you can use Amazon Redshift ML to create a K-means model, determine the appropriate number of clusters, and draw conclusions by analyzing the clusters to help make business decisions.

In the next chapter, we will show you how to use the multi-layer perceptron algorithm to perform deep learning with Amazon Redshift ML.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Serverless Machine Learning with Amazon Redshift ML
Published in: Aug 2023Publisher: PacktISBN-13: 9781804619285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Debu Panda

Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).
Read more about Debu Panda

author image
Phil Bates

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.
Read more about Phil Bates

author image
Bhanu Pittampally

Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.
Read more about Bhanu Pittampally

author image
Sumeet Joshi

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.
Read more about Sumeet Joshi