Reader small image

You're reading from  50 Algorithms Every Programmer Should Know - Second Edition

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781803247762
Edition2nd Edition
Right arrow
Author (1)
Imran Ahmad
Imran Ahmad
author image
Imran Ahmad

Imran Ahmad has been a part of cutting-edge research about algorithms and machine learning for many years. He completed his PhD in 2010, in which he proposed a new linear programming-based algorithm that can be used to optimally assign resources in a large-scale cloud computing environment. In 2017, Imran developed a real-time analytics framework named StreamSensing. He has since authored multiple research papers that use StreamSensing to process multimedia data for various machine learning algorithms. Imran is currently working at Advanced Analytics Solution Center (A2SC) at the Canadian Federal Government as a data scientist. He is using machine learning algorithms for critical use cases. Imran is a visiting professor at Carleton University, Ottawa. He has also been teaching for Google and Learning Tree for the last few years.
Read more about Imran Ahmad

Right arrow

Understanding clustering algorithms

One of the simplest and most powerful techniques used in unsupervised learning is based on grouping similar patterns together through clustering algorithms. It is used to understand a particular aspect of the data that is related to the problem we are trying to solve. Clustering algorithms look for natural grouping in data items. As the group is not based on any target or assumptions, it is classified as an unsupervised learning technique.Groupings created by various clustering algorithms are based on finding the similarities between various data points in the problem space. The best way to determine the similarities between data points will vary from problem to problem and will depend on the nature of the problem we are dealing with. Let's look at the various methods that can be used to calculate the similarities between various data points.

Quantifying similarities

The reliability of the grouping created by clustering algorithms is based on the assumption that we can accurately quantify the similarities or closeness between various data points in the problem space. This is done by using various distance measures. The following are three of the most popular methods that are used to quantify similarities:

  • Euclidean distance measure
  • Manhattan distance measure
  • Cosine distance measure

Let's look at these distance measures in more detail.

Euclidean distance

The distance between different points can quantify the similarity between two data points and is extensively used in unsupervised machine learning techniques, such as clustering. Euclidean distance is the most common and simple distance measure used. It is calculated by measuring the shortest distance between two data points in multidimensional space. For example, let's consider two points, A(1,1) and B(4,4), in a two -dimensional space, as shown in the following plot:

Chart Description automatically generated

To calculate the distance between A and B—that is d(A,B), we can use the following Pythagorean formula:

A picture containing shape Description automatically generated

Note that this calculation is for a two-dimensional problem space. For an n-dimensional problem space, we can calculate the distance between two points A and B as follows:

A picture containing shape Description automatically generated

Manhattan distance

In many situations, measuring the shortest distance between two points using the Euclidean distance measure will not truly represent the similarity or closeness between two points—for example, if two data points represent locations on a map, then the actual distance from point A to point B using ground transportation, such as a car or taxi, will be more than the distance calculated by the Euclidean distance. For situations such as these, we use Manhattan distance, which marks the longest route between two points and is a better reflection of the closeness of two points in the context of source and destination points that can be traveled to in a busy city. The comparison between the Manhattan and Euclidean distance measures is shown in the following plot:

Chart, line chart Description automatically generated

Note that the Manhattan distance will always be equal or larger than the corresponding Euclidean distance calculated.

Cosine distance

Euclidean and Manhattan distance measures do not perform well in high-dimensional space. In a high-dimensional problem space, cosine distance more accurately reflects the closeness between two data points in a multidimensional problem space. The cosine distance measure is calculated by measuring the cosine angle created by two points connected to a reference point. If the data points are close, then the angle will be narrow, irrespective of the dimensions they have. On the other hand, if they are far away, then the angle will be large:

Chart Description automatically generated

Textual data can almost be considered a highly dimensional space. As the cosine distance measure works very well with h-dimensional spaces, it is a good choice when dealing with textual data.Note that in the preceding figure, the cosine of the angle between A(2,5) and B(4.4) is the cosine distance. The reference between these points is the origin—that is, X(0,0). But in reality, any point in the problem space...

K-means clustering algorithm

The name of the k-means clustering algorithm comes from the fact that it tries to create a number of clusters, k, calculating the means to find the closeness between the data points. It uses a relatively simple clustering approach, but is still popular because of its scalability and speed. Algorithmically, k-means clustering uses an iterative logic that moves the centers of the clusters until they reflect the most representative data point of the grouping they belong to.It is important to note that k-means algorithms lack one of the very basic functionalities needed for clustering. That missing functionality is that for a given dataset, the k-means algorithm cannot determine the most appropriate number of clusters. The most appropriate number of clusters, k, is dependent on the number of natural groupings in a particular dataset. The philosophy behind this omission is to keep the algorithm as simple as possible, maximizing its performance. This...

The logic of k-means clustering

This section describes the logic of the k-means clustering algorithm. Let's look at them one by one.

Initialization

In order to group them, the k-means algorithm uses a distance measure to find the similarity or closeness between data points. Before using the k-means algorithm, the most appropriate distance measure needs to be selected. By default, the Euclidean distance measure will be used. Also, if the dataset has outliers, then a mechanism needs to be devised to determine the criteria that are to be identified and remove the outliers of the dataset.

The steps of the k-means algorithm

The steps involved in the k-means clustering algorithm are as follows:

Step 1 We choose the number of clusters,  k .
Step 2 Among the data points, we randomly choose  k  points as cluster centers.
Step 3 Based on the selected distance measure, we iteratively compute the distance from each point in the problem space to each of the  k  cluster centers. Based on the size of the dataset, this may be a time-consuming step—for example, if there are 10,000 points in the cluster and  k  = 3, this means that 30,000 distances need to be calculated.
Step 4 We assign each data point in the problem space to the nearest cluster center.
Step 5 Now each data point in our problem space has an assigned cluster center. But we are not done, as the selection of the initial cluster centers was based on random selection. We need to verify that the current randomly selected cluster centers are actually the center of...

Stop condition

For the k-means algorithm, the default stop condition is when there is no more shifting of cluster centers in step 5. But as with many other algorithms, k-means algorithms may take lot of time to converge, especially while processing large datasets in a high-dimensional problem space. Instead of waiting for the algorithm to converge, we can also explicitly define the stop condition as follows:

  • By specifying the maximum execution time:
    • Stop conditiont>tmax, where t is the current execution time and tmax is the maximum execution time we have set for the algorithm.
  • By specifying the maximum iterations:
    • Stop conditionif m>mmax, where m is the current iteration and mmax is the maximum number of iterations we have set for the algorithm.

Coding the k-means algorithm

Let's look at how we can code the k-means algorithm in Python:

  1. First, let's import the packages that we will need to code for the k-means algorithm. Note that we are importing the sklearn package for k-means clustering:
from sklearn import cluster
import pandas as pd
import numpy as np
  1. To use k-means clustering, let's create 20 data points in a two-dimensional problem space that we will be using for k-means clustering:
dataset = pd.DataFrame({
    'x': [11, 21, 28, 17, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 62, 70, 72, 10],
    'y': [39, 36, 30, 52, 53, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 18, 7, 24, 10]
})
  1. Let's have two clusters (k = 2) and then create the cluster by calling the fit functions:
myKmeans = cluster.KMeans(n_clusters=2)
myKmeans.fit(dataset)
  1. Let's create a variable named centroid that is an array that holds the location of the center...

Limitation of k-means clustering

The k-means algorithm is designed to be a simple and fast algorithm. Because of the intentional simplicity in its design, it comes with the following limitations:

  • The biggest limitation of k-means clustering is that the initial number of clusters has to be predetermined.
  • The initial assignment of cluster centers is random. This means that each time the algorithm is run, it may give slightly different clusters.
  • Each data point is assigned to only one cluster.
  • k-means clustering is sensitive to outliers.

Hierarchical clustering

k-means clustering uses a top-down approach because we start the algorithm from the most important data points, which are the cluster centers. There is an alternative approach of clustering where, instead of starting from the top, we start the algorithm from the bottom. The bottom in this context is each of the individual data points in the problem space. The solution is to keep on grouping similar data points together as it progresses up toward the cluster centers. This alternative bottom-up approach is used by hierarchical clustering algorithms, and is discussed in this section.

Steps of hierarchical clustering

The following steps are involved in hierarchical clustering:

  1. We create a separate cluster for each data point in our problem space. If our problem space consists of 100 data points, then it will start with 100 clusters.
  2. We group only those points that are closest to each other.
  3. We check for the stop condition; if the stop condition is not yet satisfied, then we repeat step 2.

The resulting clustered structure is called a dendrogram.In a dendrogram, the height of the vertical lines determines how close the items are, as shown in the following diagram:

Diagram Description automatically generated with medium confidence

Note that the stopping condition is shown as a dotted line in the preceding figure.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
50 Algorithms Every Programmer Should Know - Second Edition
Published in: Sep 2023Publisher: PacktISBN-13: 9781803247762
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Imran Ahmad

Imran Ahmad has been a part of cutting-edge research about algorithms and machine learning for many years. He completed his PhD in 2010, in which he proposed a new linear programming-based algorithm that can be used to optimally assign resources in a large-scale cloud computing environment. In 2017, Imran developed a real-time analytics framework named StreamSensing. He has since authored multiple research papers that use StreamSensing to process multimedia data for various machine learning algorithms. Imran is currently working at Advanced Analytics Solution Center (A2SC) at the Canadian Federal Government as a data scientist. He is using machine learning algorithms for critical use cases. Imran is a visiting professor at Carleton University, Ottawa. He has also been teaching for Google and Learning Tree for the last few years.
Read more about Imran Ahmad