Reader small image

You're reading from  MATLAB for Machine Learning - Second Edition

Product typeBook
Published inJan 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781835087695
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Giuseppe Ciaburro
Giuseppe Ciaburro
author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro

Right arrow

Clustering Analysis and Dimensionality Reduction

Clustering techniques aim to uncover concealed patterns or groupings within a dataset. These algorithms detect groupings without relying on any predefined labels. Instead, they select clusters based on the similarity between elements. Dimensionality reduction, on the other hand, involves transforming a dataset with numerous variables into one with fewer dimensions while preserving relevant information. Feature selection methods attempt to identify a subset of the original variables, while feature extraction reduces data dimensionality by transforming it into new features. This chapter shows us how to divide data into clusters, or groupings of similar items. We’ll also learn how to select features that best represent the set of data.

In this chapter, we will cover the following main topics:

  • Understanding clustering – basic concepts and methods
  • Understanding hierarchical clustering
  • Partitioning-based clustering...

Technical requirements

In this chapter, we will introduce basic concepts relating to machine learning. To understand these topics, a basic knowledge of algebra and mathematical modeling is needed. You will also need a working knowledge of the MATLAB environment.

To work with the MATLAB code in this chapter, you need the following files (available on GitHub at https://github.com/PacktPublishing/MATLAB-for-Machine-Learning-second-edition):

  • Minerals.xls
  • PeripheralLocations.xls
  • YachtHydrodynamics.xlsx
  • SeedsDataset.xlsx

Understanding clustering – basic concepts and methods

Clustering is a fundamental concept in data analysis, aiming to identify meaningful groupings or patterns within a dataset. It involves the partitioning of data points into distinct clusters based on their similarity or proximity to each other. In both clustering and classification, our goal is to discover the underlying rules that enable us to assign observations to the correct class. However, clustering differs from classification as it requires identifying a meaningful subdivision of classes as well. In classification, we benefit from the target variable, which provides the classification information in the training set. In contrast, clustering lacks such additional information, necessitating the deduction of classes by analyzing the spatial distribution of the data. Dense areas in the data correspond to groups of similar observations. If we can identify observations that are like each other but distinct from those in...

Understanding hierarchical clustering

Hierarchical clustering is a method of clustering that creates a hierarchy or tree-like structure of clusters. It iteratively merges or splits clusters based on the similarity or dissimilarity between data points. The resulting structure is often represented as a dendrogram, which visualizes the relationships and similarities among the data points.

There are two main types of hierarchical clustering:

  • Agglomerative hierarchical clustering: This starts with each data point considered as an individual cluster and progressively merges similar clusters until all data points belong to a single cluster. At the beginning, each data point is treated as a separate cluster, and in each iteration, the two most similar clusters are merged into a larger cluster. This process continues until all data points are in one cluster. The merging process is guided by a distance or similarity measure, such as a Euclidean distance or correlation.
  • Divisive...

Partitioning-based clustering algorithms with MATLAB

Partitioning-based clustering is a type of clustering algorithm that aims to divide a dataset into distinct groups or partitions. In this approach, each data point is assigned to exactly one cluster, and the goal is to minimize the intra-cluster distance while maximizing the inter-cluster distance. The most popular partitioning-based clustering algorithms include k-medoids, fuzzy c-means, and hierarchical k-means. These algorithms vary in their approach and objectives, but they all aim to partition the data into well-separated clusters based on some distance or similarity measure.

Introducing the k-means algorithm

One of the most well-known partitioning-based clustering algorithms is k-means. In k-means clustering, the algorithm attempts to partition the data into k clusters, where k is a predefined number specified by the user. The algorithm iteratively assigns data points to the nearest cluster centroid and recalculates the...

Grouping data using the similarity measures

The k-medoids algorithm is a variation of the k-means algorithm that uses medoids (actual data points) as representatives of each cluster instead of centroids. Unlike the k-means algorithm, which calculates the mean of the data points within each cluster, the k-medoids algorithm selects the most centrally located data point within each cluster as the medoid. This makes k-medoids more robust to outliers and suitable for data with non-Euclidean distances.

Here are some key differences between k-medoids and k-means:

  • Representative points: In k-medoids, the representatives of each cluster are actual data points from the dataset (medoids), while in k-means, the representatives are the centroids, which are calculated as the mean of the data points.
  • Distance measure: The distance measure used in k-means is typically the Euclidean distance. On the other hand, k-medoids can handle various distance measures, including non-Euclidean distances...

Discovering dimensionality reduction techniques

Dimensionality reduction is a technique used in machine learning and data analysis to reduce the number of variables or features in a dataset. The goal of dimensionality reduction is to simplify the data while retaining important information, thereby improving the efficiency and effectiveness of subsequent analysis tasks.

High-dimensional datasets can be challenging to work with due to several reasons:

  • Curse of dimensionality: As the number of features increases, the data becomes more sparse, making it difficult to find meaningful patterns or relationships
  • Computational complexity: Many algorithms and models become computationally expensive as the dimensionality of the data increases, requiring more time and resources for analysis
  • Overfitting: High-dimensional data is more susceptible to overfitting, where a model becomes too specialized to the training data and fails to generalize well to new data

Dimensionality...

Feature selection and feature extraction using MATLAB

In MATLAB, there are several built-in functions and toolboxes that can be used for dimensionality reduction. In the next section, we will explore some practical examples of the dimensionality reduction algorithm in the MATLAB environment.

Stepwise regression for feature selection

Regression analysis is a valuable approach for understanding the impact of independent variables on a dependent variable. It allows us to identify predictors that hold greater influence over the model’s response. Stepwise regression is a variable selection method used to choose a subset of predictors that exhibit the strongest relationship with the dependent variable. There are three common variable selection algorithms:

  • Forward method: The forward method starts with an empty model, where no predictors are initially selected. In the first step, the variable showing the most significant association at a statistical level is added. In...

Summary

In this chapter, we gained knowledge about performing accurate cluster analysis in the MATLAB environment. Our exploration began by understanding the measurement of similarity, including concepts such as element proximity, similarity, and dissimilarity measures. We delved into different methods for grouping objects, namely hierarchical clustering, and partitioning clustering.

Regarding partitioning clustering, we focused on the k-means method. We learned how to iteratively locate k centroids, each representing a cluster. We also examined the effectiveness of cluster separation and how to generate a silhouette plot using cluster indices obtained from k-means. The silhouette value for each data point serves as a measure of its similarity to other points within its own cluster, compared to points in other clusters. Furthermore, we delved into k-medoids clustering, which involves identifying the centers of clusters using medoids instead of centroids. We learned the procedure...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
MATLAB for Machine Learning - Second Edition
Published in: Jan 2024Publisher: PacktISBN-13: 9781835087695
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro