Reader small image

You're reading from  The Applied Artificial Intelligence Workshop

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781800205819
Edition1st Edition
Languages
Tools
Right arrow
Authors (3):
Anthony So
Anthony So
author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Zsolt Nagy
Zsolt Nagy
author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy

View More author details
Right arrow

5. Artificial Intelligence: Clustering

Overview

This chapter will introduce you to the fundamentals of clustering, an unsupervised learning approach in contrast with the supervised learning approaches seen in the previous chapters. You will be implementing different types of clustering, including flat clustering with the k-means algorithm and hierarchical clustering with the mean shift algorithm and the agglomerative hierarchical model. You will also learn how to evaluate the performance of your clustering model using intrinsic and extrinsic approaches. By the end of this chapter, you will be able to analyze data using clustering and apply this skill to solve challenges across a variety of fields.

Introduction

In the previous chapter, you were introduced to decision trees and their applications in classification. You were also introduced to regression in Chapter 2, An Introduction to Regression. Both regression and classification are part of the supervised learning approach. However, in this chapter, we will be looking at the unsupervised learning approach; we will be dealing with datasets that don't have any labels (outputs). It is up to the machines to tell us what the labels will be based on a set of parameters that we define. In this chapter, we will be performing unsupervised learning by using clustering algorithms.

We will use clustering to analyze data to find certain patterns and create groups. Apart from that, clustering can be used for many purposes:

  • Market segmentation detects the best stocks in the market you should be focusing on.
  • Customer segmentation detects customer cohorts using their consumption patterns to recommend products better.
  • ...

Defining the Clustering Problem

We shall define the clustering problem so that we will be able to find similarities between our data points. For instance, suppose we have a dataset that consists of points. Clustering helps us understand this structure by describing how these points are distributed.

Let's look at an example of data points in a two-dimensional space in Figure 5.1:

Figure 5.1: Data points in a two-dimensional space

Now, have a look, at Figure 5.2. It is evident that there are three clusters:

Figure 5.2: Three clusters formed using the data points in a two-dimensional space

The three clusters were easy to detect because the points are close to one another. Here, you can see that clustering determines the data points that are close to each other. You may have also noticed that the data points M1, O1, and N1 do not belong to any cluster; these are the outlier points. The clustering algorithm you build should be prepared...

Clustering Approaches

There are two types of clustering:

  • Flat
  • Hierarchical

In flat clustering, we specify the number of clusters we would like the machine to find. One example of flat clustering is the k-means algorithm, where k specifies the number of clusters we would like the algorithm to use.

In hierarchical clustering, however, the machine learning algorithm itself finds out the number of clusters that are needed.

Hierarchical clustering also has two approaches:

  • Agglomerative or bottom-up hierarchical clustering treats each point as a cluster to begin with. Then, the closest clusters are grouped together. The grouping is repeated until we reach a single cluster with every data point.
  • Divisive or top-down hierarchical clustering treats data points as if they were all in one single cluster at the start. Then the cluster is divided into smaller clusters by choosing the furthest data points. The splitting is repeated until each data point becomes...

The K-Means Algorithm

The k-means algorithm is a flat clustering algorithm, as mentioned previously. It works as follows:

  • Set the value of k.
  • Choose k data points from the dataset that are the initial centers of the individual clusters.
  • Calculate the distance from each data point to the chosen center points and group each point in the cluster whose initial center is the closest to the data point.
  • Once all the points are in one of the k clusters, calculate the center point of each cluster. This center point does not have to be an existing data point in the dataset; it is simply an average.
  • Repeat this process of assigning each data point to the cluster whose center is closest to the data point. Repetition continues until the center points no longer move.

To ensure that the k-means algorithm terminates, we need the following:

  • A maximum threshold value at which the algorithm will then terminate
  • A maximum number of repetitions of shifting...

The Mean Shift Algorithm

Mean shift is a hierarchical clustering algorithm that assigns data points to a cluster by calculating a cluster's center and moving it towards the mode at each iteration. The mode is the area with the most data points. At the first iteration, a random point will be chosen as the cluster's center and then the algorithm will calculate the mean of all nearby data points within a certain radius. The mean will be the new cluster's center. The second iteration will then begin with the calculation of the mean of all nearby data points and setting it as the new cluster's center. At each iteration, the cluster's center will move closer to where most of the data points are. The algorithm will stop when it is not possible for a new cluster's center to contain more data points. When the algorithm stops, each data point will be assigned to a cluster.

The mean shift algorithm will also determine the number of clusters needed, in contrast...

Clustering Performance Evaluation

Unlike supervised learning, where we always have the labels to evaluate our predictions with, unsupervised learning is a bit more complex as we do not usually have labels. In order to evaluate a clustering model, two approaches can be taken depending on whether the label data is available or not:

  • The first approach is the extrinsic method, which requires the existence of label data. This means that in absence of label data, human intervention is required in order to label the data or at least a subset of it.
  • The other approach is the intrinsic approach. In general, the extrinsic approach tries to assign a score to clustering, given the label data, whereas the intrinsic approach evaluates clustering by examining how well the clusters are separated and how compact they are.

    Note

    We will skip the mathematical explanations as they are quite complicated.

    You can find more mathematical details on the sklearn website at this URL: https://scikit...

Summary

In this chapter, we learned the basics of how clustering works. Clustering is a form of unsupervised learning where the features are given, but not the labels. It is the goal of the clustering algorithms to find the labels based on the similarity of the data points.

We also learned that there are two types of clustering, flat and hierarchical, with the first type requiring the number of clusters to find, whereas the second type finds the optimal number of clusters itself.

The k-means algorithm is an example of flat clustering, whereas mean shift and agglomerative hierarchical clustering are examples of a hierarchical clustering algorithm.

We also learned about the numerous scores to evaluate the performance of a clustering model, with the labels in the extrinsic approach or without the labels in the intrinsic approach.

In Chapter 6, Neural Networks and Deep Learning, you will be introduced to a field that has become popular in this decade due to the explosion...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Applied Artificial Intelligence Workshop
Published in: Jul 2020Publisher: PacktISBN-13: 9781800205819
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Anthony So

Anthony So is a renowned leader in data science. He has extensive experience in solving complex business problems using advanced analytics and AI in different industries including financial services, media, and telecommunications. He is currently the chief data officer of one of the most innovative fintech start-ups. He is also the author of several best-selling books on data science, machine learning, and deep learning. He has won multiple prizes at several hackathon competitions, such as Unearthed, GovHack, and Pepper Money. Anthony holds two master's degrees, one in computer science and the other in data science and innovation.
Read more about Anthony So

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

author image
Zsolt Nagy

Zsolt Nagy is an engineering manager in an ad tech company heavy on data science. After acquiring his MSc in inference on ontologies, he used AI mainly for analyzing online poker strategies to aid professional poker players in decision making. After the poker boom ended, he put extra effort into building a T-shaped profile in leadership and software engineering.
Read more about Zsolt Nagy