Reader small image

You're reading from  Hands-On Machine Learning with C++

Product typeBook
Published inMay 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781789955330
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Kirill Kolodiazhnyi
Kirill Kolodiazhnyi
author image
Kirill Kolodiazhnyi

Kirill Kolodiazhnyi is a seasoned software engineer with expertise in custom software development. He has several years of experience building machine learning models and data products using C++. He holds a bachelor degree in Computer Science from the Kharkiv National University of Radio-Electronics. He currently works in Kharkiv, Ukraine where he lives with his wife and daughter.
Read more about Kirill Kolodiazhnyi

Right arrow

Clustering

Clustering is an unsupervised machine learning method that is used for splitting the original dataset of objects into groups classified by properties. An object in machine learning is usually treated as a point in the multidimensional metric space. Every space dimension corresponds to an object property (feature), and the metric is a function of the values of these properties. Depending on the types of dimensions in this space, which can be both numerical and categorical, we choose the type of clustering algorithm and specific metric function. This choice depends on the nature of different object properties' types.

The main difference between clustering and classification is an undefined set of target groups, which is determined by the clustering algorithm. The set of target groups (clusters) is the algorithm's result.

We can split cluster analysis into the...

Technical requirements

Measuring distance in clustering

A metric or a distance measure is an essential concept in clustering because it is used to determine the similarity between objects. However, before applying a distance measure to objects, we have to make a vector of object characteristics; usually, this is a set of numerical values such as human height or weight. Also, some algorithms can work with categorical object features (or characteristics). The standard practice is to normalize feature values. Normalization ensures that each feature gives the same impact in a distance measure calculation. There are many distance measure functions that can be used in the scope of the clustering task. The most popular ones used for numerical properties are Euclidean distance, Squared Euclidean distance, Manhattan distance, and Chebyshev distance. The following subsections describe them in detail.

...

Types of clustering algorithms

There are different types of clustering, which we can classify into the following groups: partition-based, spectral, hierarchical, density-based, and model-based. The partition-based group of clustering algorithms can be logically divided into distance-based methods and ones based on graph theory.

Partition-based clustering algorithms

The partition-based methods use a similarity measure to combine objects into groups. A practitioner usually selects the similarity measure for such kinds of algorithms by themself, using prior knowledge about a problem or heuristics to select the measure properly. Sometimes, several measures need to be tried with the same algorithm to choose the best one. Also,...

Examples of using the Shogun library for dealing with the clustering task samples

The Shogun library contains implementations of the model-based, hierarchical, and partition-based clustering approaches. The model-based algorithm is called GMM (Gaussian Mixture Models), the partition one is the k-means algorithm, and hierarchical clustering is based on the bottom-up method.

GMM with Shogun

The GMM algorithm assumes that clusters can be fit to some Gaussian (normal) distributions; it uses the EM approach for training. There is a CGMM class in the Shogun library that implements this algorithm, as illustrated in the following code snippet:

     Some<CDenseFeatures<DataType>> features;
int num_clusters = 2;
...

Examples of using the Shark-ML library for dealing with the clustering task samples

The Shark-ML library implements two clustering algorithms: hierarchical clustering and the k-means algorithm.

Hierarchical clustering with Shark-ML

The Shark-ML library implements the hierarchical clustering approach in the following way: first, we need to put our data into a space-partitioning tree. For example, we can use the object of the LCTree class, which implements binary space partitioning. Also, there is the KHCTree class, which implements kernel-induced feature space partitioning. The constructor of this class takes the data for partitioning and an object that implements some stopping criteria for the tree construction. We use the...

Examples of using the Dlib library for dealing with the clustering task samples

The Dlib library provides the following clustering methods: k-means, spectral, hierarchical, and two more graph clustering algorithms: Newman and Chinese Whispers.

K-means clustering with Dlib

The Dlib library uses kernel functions as the distance functions for the k-means algorithm. An example of such a function is the radial basis function. As an initial step, we define the required types, as follows:

 typedef matrix<double, 2, 1> sample_type;
typedef radial_basis_kernel<sample_type> kernel_type;

Then, we initialize an object of the kkmeans type. Its constructor takes an object that will define cluster centroids as input parameters...

Plotting data with C++

We plot with the plotcpp library, which is a thin wrapper around the gnuplot command-line utility. With this library, we can draw points on a scatter plot or draw lines. The initial step to start plotting with this library is creating an object of the Plot class. Then, we have to specify the output destination of the drawing. We can set the destination with the Plot::SetTerminal() method and this method takes a string with a destination point abbreviation. It can be the qt string value to show the operating system (OS) window with our drawing, or it can be a string with a picture file extension to save a drawing to a file, as in the code sample that follows. Also, we can configure a title of the drawing, the axis labels, and some other parameters with the Plot class methods. However, it does not cover all possible configurations available for gnuplot. In...

Summary

In this chapter, we considered what clustering is and how it differs from classification. We saw different types of clustering methods, such as the partition-based, the spectral, the hierarchical, the density-based, and the model-based methods. Also, we observed that partition-based methods could be divided into more categories, such as the distance-based methods and the ones based on graph theory. We used implementations of these algorithms, including the k-means algorithm (the distance-based method), the GMM algorithm (the model-based method), the Newman modularity-based algorithm, and the Chinese Whispers algorithm for graph clustering. We also saw how to use the hierarchical and spectral clustering algorithm implementations in programs. We saw that the crucial issues for successful clustering are as follows:

  • The choice of the distance measure function
  • The initialization...

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Machine Learning with C++
Published in: May 2020Publisher: PacktISBN-13: 9781789955330
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kirill Kolodiazhnyi

Kirill Kolodiazhnyi is a seasoned software engineer with expertise in custom software development. He has several years of experience building machine learning models and data products using C++. He holds a bachelor degree in Computer Science from the Kharkiv National University of Radio-Electronics. He currently works in Kharkiv, Ukraine where he lives with his wife and daughter.
Read more about Kirill Kolodiazhnyi