Reader small image

You're reading from  50 Algorithms Every Programmer Should Know - Second Edition

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781803247762
Edition2nd Edition
Right arrow
Author (1)
Imran Ahmad
Imran Ahmad
author image
Imran Ahmad

Imran Ahmad has been a part of cutting-edge research about algorithms and machine learning for many years. He completed his PhD in 2010, in which he proposed a new linear programming-based algorithm that can be used to optimally assign resources in a large-scale cloud computing environment. In 2017, Imran developed a real-time analytics framework named StreamSensing. He has since authored multiple research papers that use StreamSensing to process multimedia data for various machine learning algorithms. Imran is currently working at Advanced Analytics Solution Center (A2SC) at the Canadian Federal Government as a data scientist. He is using machine learning algorithms for critical use cases. Imran is a visiting professor at Carleton University, Ottawa. He has also been teaching for Google and Learning Tree for the last few years.
Read more about Imran Ahmad

Right arrow

Unsupervised Machine Learning Algorithms

This chapter is about unsupervised machine learning algorithms. We aim, by the end of this chapter, to be able to understand how unsupervised learning, with its basic algorithms and methodologies, can be effectively applied to solve real-world problems.

We will cover the following topics:

  • Introducing unsupervised learning
  • Understanding clustering algorithms
  • Dimensionality reduction
  • Association rules mining

Introducing unsupervised learning

If the data is not generated randomly, it tends to exhibit certain patterns or relationships among its elements within a multi-dimensional space. Unsupervised learning involves the process of detecting and utilizing these patterns within a dataset to structure and comprehend it more effectively. Unsupervised learning algorithms uncover these patterns and use them as a foundation for imparting a certain structure to the dataset. The identification of these patterns contributes to a deeper understanding and representation of the data. Extracting patterns from raw data leads to a better understanding of the raw data.

This concept is shown in Figure 6.1:

Figure 6.1: Using unsupervised machine learning to extract patterns from unlabeled raw data

In the upcoming discussion, we will navigate through the CRISP-DM lifecycle, a popular model for the machine learning process. Within this context, we’ll pinpoint where unsupervised learning...

Section 2: Machine Learning Algorithms

This section explains the different kinds of machine learning algorithms, such as unsupervised machine learning algorithms and traditional supervised learning algorithms, in detail and also introduces algorithms for natural language processing. The section ends with introducing us to recommendation engines. The chapters included in this section are:

  • Chapter 6Unsupervised Machine Learning Algorithms
  • Chapter 7Traditional Supervised Learning Algorithms
  • Chapter 8Neural Network Algorithms
  • Chapter 9, Algorithms for Natural Language Processing
  • Chapter 10Recommendation Engines

Steps of hierarchical clustering

The following steps are involved in hierarchical clustering:

  1. We create a separate cluster for each data point in our problem space. If our problem space consists of 100 data points, then it will start with 100 clusters.
  2. We group only those points that are closest to each other.
  3. We check for the stop condition; if the stop condition is not yet satisfied, then we repeat step 2.

The resulting clustered structure is called a dendrogram.

In a dendrogram, the height of the vertical lines determines how close the items are, as shown in the following diagram:

Figure 6.8: Hierarchical clustering

Note that the stop condition is shown as a dotted line in Figure 6.8.

Coding a hierarchical clustering algorithm

Let’s learn how we can code a hierarchical algorithm in Python:

  1. We will first import AgglomerativeClustering from the sklearn.cluster library, along with the pandas and numpy packages:
    from sklearn.cluster import AgglomerativeClustering
    import pandas as pd
    import numpy as np
    
  2. Then we will create 20 data points in a two-dimensional problem space:
    dataset = pd.DataFrame({
        'x': [11, 11, 20, 12, 16, 33, 24, 14, 45, 52, 51, 52, 55, 53, 55, 61, 62, 70, 72, 10],
        'y': [39, 36, 30, 52, 53, 46, 55, 59, 12, 15, 16, 18, 11, 23, 14, 8, 18, 7, 24, 70]
    })
    
  3. Then we create the hierarchical cluster by specifying the hyperparameters. Note that a hyperparameter refers to a configuration parameter of a machine learning model that is set before the training process and influences the model’s behavior and performance. We use the fit_predict function to actually process...

Understanding DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised learning technique that performs clustering based on the density of the points. The basic idea is based on the assumption that if we group the data points in a crowded or high-density space together, we can achieve meaningful clustering.

This approach to clustering has two important implications:

  • Using this idea, the algorithm is likely to cluster together the points that exist together regardless of their shape or pattern. This methodology helps in creating clusters of arbitrary shapes. By “shape,” we refer to the pattern or distribution of data points in a multi-dimensional space. This capability is advantageous because real-world data is often complex and non-linear, and the ability to create clusters of arbitrary shapes enables more accurate representation and understanding of such data.
  • Unlike the k-means algorithm, we do not need to...

Creating clusters using DBSCAN in Python

First, we will import the necessary functions from the sklearn library:

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

Let’s employ DBSCAN to tackle a slightly more complex clustering problem, one that involves structures known as “half-moons.” In this context, “half-moons” refer to two sets of data points that are shaped like crescents, with each moon representing a unique cluster. Such datasets pose a challenge because the clusters are not linearly separable, meaning a straight line cannot easily divide the different groups.

This is where the concept of “nonlinear class boundaries” comes into play. In contrast to linear class boundaries, which can be represented by a straight line, nonlinear class boundaries are more complex, often necessitating curved lines or multidimensional surfaces to accurately segregate different classes or clusters.

To generate...

Evaluating the clusters

The objective of good quality clustering is that the data points that belong to the separate clusters should be differentiable. This implies the following:

  • The data points that belong to the same cluster should be as similar as possible.
  • Data points that belong to separate clusters should be as different as possible.

Human intuition can be used to evaluate the clustering results by visualizing the clusters, but there are mathematical methods that can quantify the quality of the clusters. They not only measure the tightness of each cluster (cohesion) and the separation between different clusters but also offer a numerical, hence objective, way to assess the quality of clustering. Silhouette analysis is one such technique that compares the tightness and separation in the clusters created by the k-means algorithm. It’s a metric that quantifies the degree of cohesion and separation in clusters. While this technique has been mentioned...

Dimensionality reduction

Each feature in our data corresponds to a dimension in our problem space. Minimizing the number of features to make our problem space simpler is called dimensionality reduction. It can be done in one of the following two ways:

  • Feature selection: Selecting a set of features that are important in the context of the problem we are trying to solve
  • Feature aggregation: Combining two or more features to reduce dimensions using one of the following algorithms:
    • PCA: A linear unsupervised ML algorithm
    • Linear discriminant analysis (LDA): A linear supervised ML algorithm
    • KPCA: A nonlinear algorithm

Let’s look deeper at one of the popular dimensionality reduction algorithms, namely PCA, in more detail.

Principal component analysis

PCA is a method in unsupervised machine learning that is typically employed to reduce the dimensionality of datasets through a process known as linear transformation...

Association rules mining

An association rule mathematically describes the relationship items involved in various transactions. It does this by investigating the relationship between two item sets in the form XY, where , . In addition, X and Y are non overlapping item sets; which means that .

An association rule could be described in the following form:

{helmets, balls} ⇒ {bike}

Here, {helmets, balls} is X, and {bike} is Y.

Let us look into the different types of association rules.

Types of rules

Running associative analysis algorithms will typically result in the generation of a large number of rules from a transaction dataset. Most of them are useless. To pick rules that can result in useful information, we can classify them as one of the following three types:

  • Trivial
  • Inexplicable
  • Actionable

Let’s look at each of these types in more detail.

Trivial rules

Among the large numbers of rules generated...

Summary

In this chapter, we looked at various unsupervised machine learning techniques. We looked at the circumstances in which it is a good idea to try to reduce the dimensionality of the problem we are trying to solve and the different methods of doing this. We also studied the practical examples where unsupervised machine learning techniques can be very helpful, including market basket analysis.

In the next chapter, we will look at the various supervised learning techniques. We will start with linear regression and then we will look at more sophisticated supervised machine learning techniques, such as decision-tree-based algorithms, SVM, and XGBoost. We will also study the Naive Bayes algorithm, which is best suited for unstructured textual data.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://packt.link/WHLel...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
50 Algorithms Every Programmer Should Know - Second Edition
Published in: Sep 2023Publisher: PacktISBN-13: 9781803247762
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Imran Ahmad

Imran Ahmad has been a part of cutting-edge research about algorithms and machine learning for many years. He completed his PhD in 2010, in which he proposed a new linear programming-based algorithm that can be used to optimally assign resources in a large-scale cloud computing environment. In 2017, Imran developed a real-time analytics framework named StreamSensing. He has since authored multiple research papers that use StreamSensing to process multimedia data for various machine learning algorithms. Imran is currently working at Advanced Analytics Solution Center (A2SC) at the Canadian Federal Government as a data scientist. He is using machine learning algorithms for critical use cases. Imran is a visiting professor at Carleton University, Ottawa. He has also been teaching for Google and Learning Tree for the last few years.
Read more about Imran Ahmad