You're reading from 50 Algorithms Every Programmer Should Know - Second Edition

Product typeBook

Published inSep 2023

PublisherPackt

ISBN-139781803247762

Edition2nd Edition

Concepts

Data Structures and Algorithms

Author (1)

Imran Ahmad

Unsupervised Machine Learning Algorithms

This chapter is about unsupervised machine learning algorithms. We aim, by the end of this chapter, to be able to understand how unsupervised learning, with its basic algorithms and methodologies, can be effectively applied to solve real-world problems.

We will cover the following topics:

Introducing unsupervised learning
Understanding clustering algorithms
Dimensionality reduction
Association rules mining

Introducing unsupervised learning

If the data is not generated randomly, it tends to exhibit certain patterns or relationships among its elements within a multi-dimensional space. Unsupervised learning involves the process of detecting and utilizing these patterns within a dataset to structure and comprehend it more effectively. Unsupervised learning algorithms uncover these patterns and use them as a foundation for imparting a certain structure to the dataset. The identification of these patterns contributes to a deeper understanding and representation of the data. Extracting patterns from raw data leads to a better understanding of the raw data.

This concept is shown in Figure 6.1:

Figure 6.1: Using unsupervised machine learning to extract patterns from unlabeled raw data

In the upcoming discussion, we will navigate through the CRISP-DM lifecycle, a popular model for the machine learning process. Within this context, we’ll pinpoint where unsupervised learning...

Section 2: Machine Learning Algorithms

This section explains the different kinds of machine learning algorithms, such as unsupervised machine learning algorithms and traditional supervised learning algorithms, in detail and also introduces algorithms for natural language processing. The section ends with introducing us to recommendation engines. The chapters included in this section are:

Chapter 6, Unsupervised Machine Learning Algorithms
Chapter 7, Traditional Supervised Learning Algorithms
Chapter 8, Neural Network Algorithms
Chapter 9, Algorithms for Natural Language Processing
Chapter 10, Recommendation Engines

Steps of hierarchical clustering

The following steps are involved in hierarchical clustering:

We create a separate cluster for each data point in our problem space. If our problem space consists of 100 data points, then it will start with 100 clusters.
We group only those points that are closest to each other.
We check for the stop condition; if the stop condition is not yet satisfied, then we repeat step 2.

The resulting clustered structure is called a dendrogram.

In a dendrogram, the height of the vertical lines determines how close the items are, as shown in the following diagram:

Figure 6.8: Hierarchical clustering

Note that the stop condition is shown as a dotted line in Figure 6.8.

Coding a hierarchical clustering algorithm

Let’s learn how we can code a hierarchical algorithm in Python:

We will first import AgglomerativeClustering from the sklearn.cluster library, along with the pandas and numpy packages:
```
from sklearn.cluster import AgglomerativeClustering
import pandas as pd
import numpy as np
```

Then we will create 20 data points in a two-dimensional problem space:

dataset = pd.DataFrame({
    'x': [11, 11, 20, 12, 16, 33, 24, 14, 45, 52, 51, 52, 55, 53, 55, 61, 62, 70, 72, 10],
    'y': [39, 36, 30, 52, 53, 46, 55, 59, 12, 15, 16, 18, 11, 23, 14, 8, 18, 7, 24, 70]
})

Then we create the hierarchical cluster by specifying the hyperparameters. Note that a hyperparameter refers to a configuration parameter of a machine learning model that is set before the training process and influences the model’s behavior and performance. We use the fit_predict function to actually process...

Understanding DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised learning technique that performs clustering based on the density of the points. The basic idea is based on the assumption that if we group the data points in a crowded or high-density space together, we can achieve meaningful clustering.

This approach to clustering has two important implications:

Using this idea, the algorithm is likely to cluster together the points that exist together regardless of their shape or pattern. This methodology helps in creating clusters of arbitrary shapes. By “shape,” we refer to the pattern or distribution of data points in a multi-dimensional space. This capability is advantageous because real-world data is often complex and non-linear, and the ability to create clusters of arbitrary shapes enables more accurate representation and understanding of such data.
Unlike the k-means algorithm, we do not need to...

Creating clusters using DBSCAN in Python

First, we will import the necessary functions from the sklearn library:

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

Let’s employ DBSCAN to tackle a slightly more complex clustering problem, one that involves structures known as “half-moons.” In this context, “half-moons” refer to two sets of data points that are shaped like crescents, with each moon representing a unique cluster. Such datasets pose a challenge because the clusters are not linearly separable, meaning a straight line cannot easily divide the different groups.

This is where the concept of “nonlinear class boundaries” comes into play. In contrast to linear class boundaries, which can be represented by a straight line, nonlinear class boundaries are more complex, often necessitating curved lines or multidimensional surfaces to accurately segregate different classes or clusters.

To generate...

Evaluating the clusters

The objective of good quality clustering is that the data points that belong to the separate clusters should be differentiable. This implies the following:

The data points that belong to the same cluster should be as similar as possible.
Data points that belong to separate clusters should be as different as possible.

Human intuition can be used to evaluate the clustering results by visualizing the clusters, but there are mathematical methods that can quantify the quality of the clusters. They not only measure the tightness of each cluster (cohesion) and the separation between different clusters but also offer a numerical, hence objective, way to assess the quality of clustering. Silhouette analysis is one such technique that compares the tightness and separation in the clusters created by the k-means algorithm. It’s a metric that quantifies the degree of cohesion and separation in clusters. While this technique has been mentioned...

Dimensionality reduction

Each feature in our data corresponds to a dimension in our problem space. Minimizing the number of features to make our problem space simpler is called dimensionality reduction. It can be done in one of the following two ways:

Feature selection: Selecting a set of features that are important in the context of the problem we are trying to solve
Feature aggregation: Combining two or more features to reduce dimensions using one of the following algorithms:
- PCA: A linear unsupervised ML algorithm
- Linear discriminant analysis (LDA): A linear supervised ML algorithm
- KPCA: A nonlinear algorithm

Let’s look deeper at one of the popular dimensionality reduction algorithms, namely PCA, in more detail.

Principal component analysis

PCA is a method in unsupervised machine learning that is typically employed to reduce the dimensionality of datasets through a process known as linear transformation...

Association rules mining

An association rule mathematically describes the relationship items involved in various transactions. It does this by investigating the relationship between two item sets in the form X ⇒ Y, where , . In addition, X and Y are non overlapping item sets; which means that .

An association rule could be described in the following form:

{helmets, balls} ⇒ {bike}

Here, {helmets, balls} is X, and {bike} is Y.

Let us look into the different types of association rules.

Types of rules

Running associative analysis algorithms will typically result in the generation of a large number of rules from a transaction dataset. Most of them are useless. To pick rules that can result in useful information, we can classify them as one of the following three types:

Trivial
Inexplicable
Actionable

Let’s look at each of these types in more detail.

Trivial rules

Among the large numbers of rules generated...

Summary

In this chapter, we looked at various unsupervised machine learning techniques. We looked at the circumstances in which it is a good idea to try to reduce the dimensionality of the problem we are trying to solve and the different methods of doing this. We also studied the practical examples where unsupervised machine learning techniques can be very helpful, including market basket analysis.

In the next chapter, we will look at the various supervised learning techniques. We will start with linear regression and then we will look at more sophisticated supervised machine learning techniques, such as decision-tree-based algorithms, SVM, and XGBoost. We will also study the Naive Bayes algorithm, which is best suited for unstructured textual data.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://packt.link/WHLel...

The rest of the chapter is locked

You have been reading a chapter from

50 Algorithms Every Programmer Should Know - Second Edition

Published in: Sep 2023Publisher: PacktISBN-13: 9781803247762

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at £13.99/month. Cancel anytime

Author (1)

Imran Ahmad

Imran Ahmad has been a part of cutting-edge research about algorithms and machine learning for many years. He completed his PhD in 2010, in which he proposed a new linear programming-based algorithm that can be used to optimally assign resources in a large-scale cloud computing environment. In 2017, Imran developed a real-time analytics framework named StreamSensing. He has since authored multiple research papers that use StreamSensing to process multimedia data for various machine learning algorithms. Imran is currently working at Advanced Analytics Solution Center (A2SC) at the Canadian Federal Government as a data scientist. He is using machine learning algorithms for critical use cases. Imran is a visiting professor at Carleton University, Ottawa. He has also been teaching for Google and Learning Tree for the last few years.
Read more about Imran Ahmad

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5