You're reading from Serverless Machine Learning with Amazon Redshift ML

Product typeBook

Published inAug 2023

Reading LevelBeginner

PublisherPackt

ISBN-139781804619285

Edition1st Edition

Languages

Python

Tools

Amazon Redshift

Concepts

Machine Learning

Authors (4):

Debu Panda

Phil Bates

Bhanu Pittampally

Sumeet Joshi

View More author details

Building Unsupervised Models with K-Means Clustering

So far, we have learned about building machine learning (ML) models where data is supplied with labels. In this chapter, we will learn about building ML models on a dataset without any labels by using the K-means clustering algorithm. Unlike supervised models, where predictions are made at the observation level, K-means clustering groups observations into clusters where they share a commonality – for example, similar demographics or reading habits.

This chapter will provide detailed examples of business problems that can be solved with these modeling techniques. By the end of this chapter, you will be in a position to identify a business problem that an unsupervised modeling technique can be applied to. You will also learn how to build, train, and evaluate K-means model performance.

In this chapter, we will cover the following main topics:

Grouping data through cluster analysis
Creating a K-means ML model...

Technical requirements

This chapter requires a web browser and access to the following:

An AWS account
An Amazon Redshift Serverless endpoint
Amazon Redshift Query Editor v2
Complete the Getting started with Amazon Redshift Serverless section in Chapter 1

You can find the code used in this chapter here: https://github.com/PacktPublishing/Serverless-Machine-Learning-with-Amazon-Redshift/blob/main/CodeFiles/chapter8/chapter8.sql.

Grouping data through cluster analysis

So far, we have explored datasets that contained input and target variables, and we trained a model with a set of input variables and a target variable. This is called supervised learning. However, how do you address a dataset that does not contain a label to supervise the training? Amazon Redshift ML supports unsupervised learning using the cluster analysis method, also known as the K-means algorithm. In cluster analysis, the ML algorithm automatically discovers the grouping of data points. For example, if you have a population of 1,000 people, a clustering algorithm can group them based on height, weight, or age.

Unlike supervised learning, where an ML model predicts an outcome based on a label, unsupervised models use unlabeled data. One type of unsupervised learning is clustering, where unlabeled data is grouped based on its similarity or differences. From a dataset with demographic information about individuals, you can create clusters...

Determining the optimal number of clusters

One popular method that is frequently adopted is the Elbow method. The idea of the Elbow method is to run K-means algorithms with different values of K – for example, from 1 cluster all the way to 10 – and for each value of K, calculate the sum of squared errors. Then, plot a chart of the sum of squared deviation (SSD) values. SSD is the sum of the squared difference and is used to measure variance. If the line chart looks like an arm, then the elbow on the arm is the value of K that is the best among the various K values. The method behind this approach is that SSD usually tends to decrease as the value of K is increased, and the goal of the evaluation method is also to aim for lower SSD or mean squared deviation (MSD) values. The elbow represents a starting point, where SSD starts to have diminishing returns when the K value increases.

In the following chart, you can see that the MSD value, when charted over different K...

Creating a K-means ML model

In this section, we will walk through the process with the help of a use case. In this use case, assume you are a data analyst for an e-commerce company specializing in home improvement goods. You have been tasked with classifying economic segments in different regions, based on income, so that you can better target customers, based on various factors, such as median home value. We will use this dataset from Kaggle: https://www.kaggle.com/datasets/camnugent/california-housing-prices.

From this dataset, you will use the median_income, latitude, and longitude attributes so that you can create clusters based on location and income.

The syntax to create a K-means model is slightly different from what you will have used up to this point, so let’s dive into that.

Creating a model syntax overview for K-means clustering

Here is the basic syntax to create a K-means model:

CREATE model model_name
FROM (Select_statement)
FUNCTION  function_name...

Evaluating the results of the K-means clustering

Now that you have segmented your clusters with the K-means algorithm, you are ready to perform various analyses using the model you created.

Here is an example query you can run to get the average median house value by cluster:

select avg(median_house_value) as avg_median_house_value,
chapter8_kmeans_clustering
.get_housing_segment_k3(median_income, latitude, longitude) as cluster
from chapter8_kmeans_clustering
.housing_prices
group by 2
order by 1;

The output will look like this:

Figure 8.12 – Average median house values

You can also run a query to see whether higher median incomes correspond to the same clusters with higher home values. Run the following query:

select avg(median_income) as median_income,
chapter8_kmeans_clustering.get_housing_segment_k3(
    median_income, latitude, longitude) as cluster
from chapter8_kmeans_clustering.housing_prices
group by 2
order...

Summary

In this chapter, we discussed how to do unsupervised learning with the K-means algorithm.

You are now able to explain what the K-means algorithm is and what use cases it is appropriate for. Also, you can use Amazon Redshift ML to create a K-means model, determine the appropriate number of clusters, and draw conclusions by analyzing the clusters to help make business decisions.

In the next chapter, we will show you how to use the multi-layer perceptron algorithm to perform deep learning with Amazon Redshift ML.

The rest of the chapter is locked

You have been reading a chapter from

Serverless Machine Learning with Amazon Redshift ML

Published in: Aug 2023Publisher: PacktISBN-13: 9781804619285

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Debu Panda

Debu Panda, a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009).
Read more about Debu Panda

Phil Bates

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS. He has more than 25 years of experience implementing large-scale data warehouse solutions. He is passionate about helping customers through their cloud journey and leveraging the power of ML within their data warehouse.
Read more about Phil Bates

Bhanu Pittampally

Bhanu Pittampally is Analytics Specialist Solutions Architect at Amazon Web Services. His background is in data and analytics and is in the field for over 16 years. He currently lives in Frisco, TX with his wife Kavitha and daughters Vibha and Medha.
Read more about Bhanu Pittampally

Sumeet Joshi

Sumeet Joshi is an Analytics Specialist Solutions Architect based out of New York. He specializes in building large-scale data warehousing solutions. He has over 17 years of experience in the data warehousing and analytical space.
Read more about Sumeet Joshi

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages