You're reading from Machine Learning with PyTorch and Scikit-Learn

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801819312

Pages 774 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (3):

Sebastian Raschka

Yuxi (Hayden) Liu

Vahid Mirjalili

View More author details

Table of Contents (22) Chapters

Preface

1. Giving Computers the Ability to Learn from Data

2. Training Simple Machine Learning Algorithms for Classification

3. A Tour of Machine Learning Classifiers Using Scikit-Learn

4. Building Good Training Datasets – Data Preprocessing

5. Compressing Data via Dimensionality Reduction

6. Learning Best Practices for Model Evaluation and Hyperparameter Tuning

7. Combining Different Models for Ensemble Learning

8. Applying Machine Learning to Sentiment Analysis

9. Predicting Continuous Target Variables with Regression Analysis

10. Working with Unlabeled Data – Clustering Analysis

11. Implementing a Multilayer Artificial Neural Network from Scratch

12. Parallelizing Neural Network Training with PyTorch

13. Going Deeper – The Mechanics of PyTorch

14. Classifying Images with Deep Convolutional Neural Networks

15. Modeling Sequential Data Using Recurrent Neural Networks

16. Transformers – Improving Natural Language Processing with Attention Mechanisms

17. Generative Adversarial Networks for Synthesizing New Data

18. Graph Neural Networks for Capturing Dependencies in Graph Structured Data

19. Reinforcement Learning for Decision Making in Complex Environments

20. Other Books You May Enjoy

21. Index

Working with Unlabeled Data – Clustering Analysis

In the previous chapters, we used supervised learning techniques to build machine learning models, using data where the answer was already known—the class labels were already available in our training data. In this chapter, we will switch gears and explore cluster analysis, a category of unsupervised learning techniques that allows us to discover hidden structures in data where we do not know the right answer upfront. The goal of clustering is to find a natural grouping in data so that items in the same cluster are more similar to each other than to those from different clusters.

Given its exploratory nature, clustering is an exciting topic, and in this chapter, you will learn about the following concepts, which can help us to organize data into meaningful structures:

Finding centers of similarity using the popular k-means algorithm
Taking a bottom-up approach to building hierarchical clustering trees...

Grouping objects by similarity using k-means

In this section, we will learn about one of the most popular clustering algorithms, k-means, which is widely used in academia as well as in industry. Clustering (or cluster analysis) is a technique that allows us to find groups of similar objects that are more related to each other than to objects in other groups. Examples of business-oriented applications of clustering include the grouping of documents, music, and movies by different topics, or finding customers that share similar interests based on common purchase behaviors as a basis for recommendation engines.

k-means clustering using scikit-learn

As you will see in a moment, the k-means algorithm is extremely easy to implement, but it is also computationally very efficient compared to other clustering algorithms, which might explain its popularity. The k-means algorithm belongs to the category of prototype-based clustering.

We will discuss two other categories of clustering...

Organizing clusters as a hierarchical tree

In this section, we will look at an alternative approach to prototype-based clustering: hierarchical clustering. One advantage of the hierarchical clustering algorithm is that it allows us to plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the interpretation of the results by creating meaningful taxonomies. Another advantage of this hierarchical approach is that we do not need to specify the number of clusters upfront.

The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical clustering. In divisive hierarchical clustering, we start with one cluster that encompasses the complete dataset, and we iteratively split the cluster into smaller clusters until each cluster only contains one example. In this section, we will focus on agglomerative clustering, which takes the opposite approach. We start with each example as an individual cluster and merge the closest...

Locating regions of high density via DBSCAN

Although we can’t cover the vast number of different clustering algorithms in this chapter, let’s at least include one more approach to clustering: density-based spatial clustering of applications with noise (DBSCAN), which does not make assumptions about spherical clusters like k-means, nor does it partition the dataset into hierarchies that require a manual cut-off point. As its name implies, density-based clustering assigns cluster labels based on dense regions of points. In DBSCAN, the notion of density is defined as the number of points within a specified radius, .

According to the DBSCAN algorithm, a special label is assigned to each example (data point) using the following criteria:

A point is considered a core point if at least a specified number (MinPts) of neighboring points fall within the specified radius,
A border point is a point that has fewer neighbors than MinPts within , but lies within...

Summary

In this chapter, you learned about three different clustering algorithms that can help us with the discovery of hidden structures or information in data. We started with a prototype-based approach, k-means, which clusters examples into spherical shapes based on a specified number of cluster centroids. Since clustering is an unsupervised method, we do not enjoy the luxury of ground-truth labels to evaluate the performance of a model. Thus, we used intrinsic performance metrics, such as the elbow method or silhouette analysis, as an attempt to quantify the quality of clustering.

We then looked at a different approach to clustering: agglomerative hierarchical clustering. Hierarchical clustering does not require specifying the number of clusters upfront, and the result can be visualized in a dendrogram representation, which can help with the interpretation of the results. The last clustering algorithm that we covered in this chapter was DBSCAN, an algorithm that groups points...