You're reading from Hands-On Machine Learning with C++

Product typeBook

Published inMay 2020

Reading LevelIntermediate

PublisherPackt

ISBN-139781789955330

Edition1st Edition

Languages

C++

Tools

Caffe

Concepts

Machine Learning

Author (1)

Kirill Kolodiazhnyi

Clustering

Clustering is an unsupervised machine learning method that is used for splitting the original dataset of objects into groups classified by properties. An object in machine learning is usually treated as a point in the multidimensional metric space. Every space dimension corresponds to an object property (feature), and the metric is a function of the values of these properties. Depending on the types of dimensions in this space, which can be both numerical and categorical, we choose the type of clustering algorithm and specific metric function. This choice depends on the nature of different object properties' types.

The main difference between clustering and classification is an undefined set of target groups, which is determined by the clustering algorithm. The set of target groups (clusters) is the algorithm's result.

We can split cluster analysis into the...

Technical requirements

The required technologies and installations for this chapter include the following:

Modern C++ compiler with C++17 support
CMake build system version >= 3.8
Dlib library installation
Shogun-toolbox library installation
Shark-ML library installation
plotcpp library installation

The code files for this chapter can be found at the following GitHub repo: https://github.com/PacktPublishing/Hands-On-Machine-Learning-with-CPP/tree/master/Chapter04

Measuring distance in clustering

A metric or a distance measure is an essential concept in clustering because it is used to determine the similarity between objects. However, before applying a distance measure to objects, we have to make a vector of object characteristics; usually, this is a set of numerical values such as human height or weight. Also, some algorithms can work with categorical object features (or characteristics). The standard practice is to normalize feature values. Normalization ensures that each feature gives the same impact in a distance measure calculation. There are many distance measure functions that can be used in the scope of the clustering task. The most popular ones used for numerical properties are Euclidean distance, Squared Euclidean distance, Manhattan distance, and Chebyshev distance. The following subsections describe them in detail.

...

Types of clustering algorithms

There are different types of clustering, which we can classify into the following groups: partition-based, spectral, hierarchical, density-based, and model-based. The partition-based group of clustering algorithms can be logically divided into distance-based methods and ones based on graph theory.

Partition-based clustering algorithms

The partition-based methods use a similarity measure to combine objects into groups. A practitioner usually selects the similarity measure for such kinds of algorithms by themself, using prior knowledge about a problem or heuristics to select the measure properly. Sometimes, several measures need to be tried with the same algorithm to choose the best one. Also,...

Examples of using the Shogun library for dealing with the clustering task samples

The Shogun library contains implementations of the model-based, hierarchical, and partition-based clustering approaches. The model-based algorithm is called GMM (Gaussian Mixture Models), the partition one is the k-means algorithm, and hierarchical clustering is based on the bottom-up method.

GMM with Shogun

The GMM algorithm assumes that clusters can be fit to some Gaussian (normal) distributions; it uses the EM approach for training. There is a CGMM class in the Shogun library that implements this algorithm, as illustrated in the following code snippet:

     Some<CDenseFeatures<DataType>> features;
     int num_clusters = 2;
    ...

Examples of using the Shark-ML library for dealing with the clustering task samples

The Shark-ML library implements two clustering algorithms: hierarchical clustering and the k-means algorithm.

Hierarchical clustering with Shark-ML

The Shark-ML library implements the hierarchical clustering approach in the following way: first, we need to put our data into a space-partitioning tree. For example, we can use the object of the LCTree class, which implements binary space partitioning. Also, there is the KHCTree class, which implements kernel-induced feature space partitioning. The constructor of this class takes the data for partitioning and an object that implements some stopping criteria for the tree construction. We use the...

Examples of using the Dlib library for dealing with the clustering task samples

The Dlib library provides the following clustering methods: k-means, spectral, hierarchical, and two more graph clustering algorithms: Newman and Chinese Whispers.

K-means clustering with Dlib

The Dlib library uses kernel functions as the distance functions for the k-means algorithm. An example of such a function is the radial basis function. As an initial step, we define the required types, as follows:

 typedef matrix<double, 2, 1> sample_type;
 typedef radial_basis_kernel<sample_type> kernel_type;

Then, we initialize an object of the kkmeans type. Its constructor takes an object that will define cluster centroids as input parameters...

Plotting data with C++

We plot with the plotcpp library, which is a thin wrapper around the gnuplot command-line utility. With this library, we can draw points on a scatter plot or draw lines. The initial step to start plotting with this library is creating an object of the Plot class. Then, we have to specify the output destination of the drawing. We can set the destination with the Plot::SetTerminal() method and this method takes a string with a destination point abbreviation. It can be the qt string value to show the operating system (OS) window with our drawing, or it can be a string with a picture file extension to save a drawing to a file, as in the code sample that follows. Also, we can configure a title of the drawing, the axis labels, and some other parameters with the Plot class methods. However, it does not cover all possible configurations available for gnuplot. In...

Summary

In this chapter, we considered what clustering is and how it differs from classification. We saw different types of clustering methods, such as the partition-based, the spectral, the hierarchical, the density-based, and the model-based methods. Also, we observed that partition-based methods could be divided into more categories, such as the distance-based methods and the ones based on graph theory. We used implementations of these algorithms, including the k-means algorithm (the distance-based method), the GMM algorithm (the model-based method), the Newman modularity-based algorithm, and the Chinese Whispers algorithm for graph clustering. We also saw how to use the hierarchical and spectral clustering algorithm implementations in programs. We saw that the crucial issues for successful clustering are as follows:

The choice of the distance measure function
The initialization...

The 5 Clustering Algorithms Data Scientists Need to Know: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Clustering: https://scikit-learn.org/stable/modules/clustering.html
Different Types of Clustering Algorithm: https://www.geeksforgeeks.org/different-types-clustering-algorithm/
An Introduction to Clustering and different methods of clustering: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/
Graph theory introductory book: Graph Theory (Graduate Texts in Mathematics) by Adrian Bondy and U.S.R. Murty.
This book covers a lot of aspects of ML theory and algorithms: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Machine Learning with C++

Published in: May 2020Publisher: PacktISBN-13: 9781789955330

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Kirill Kolodiazhnyi

Kirill Kolodiazhnyi is a seasoned software engineer with expertise in custom software development. He has several years of experience building machine learning models and data products using C++. He holds a bachelor degree in Computer Science from the Kharkiv National University of Radio-Electronics. He currently works in Kharkiv, Ukraine where he lives with his wife and daughter.
Read more about Kirill Kolodiazhnyi

Other recommended products

Related to this chapter

Caffe2 Quick Start Guide

Caffe2 by Facebook is a popular and relatively lightweight deep learning framework. Caffe2 is known for speed, accuracy and high efficiency in training neural networks. Caffe2 is widely used in mobile apps. This book is a fast paced guide that will teach you how to train and deploy deep learning models with Caffe2 on resource constrained platforms.

BookMay 2019136 pages

Mastering Java for Data Science

Java is the most wide-spread programming language nowadays, and you fill find it everywhere, from small startup companies to large enterprises. It is also a common choice for developing Data Science applications thanks to Java's prevalence and rich data processing toolbox. This book will explain how to use Java for Data Science, overview the available Machine Learning libraries, and cover many topics including supervised and unsupervised learning, natural language processing, deep learning, and big data

BookApr 2017364 pages

Machine Learning with Scala Quick Start Guide

Scala as a programming language is a highly scalable integration of object-oriented and functional programming, which makes it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to train effective machine learning models using this popular language.

BookApr 2019220 pages

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

This book covers the theory and practice of building data-driven solutions. Includes the end-to-end process, using supervised and unsupervised algorithms. With each algorithm, you will learn the data acquisition and data engineering methods, the apt metrics, and the available hyper-parameters. You will learn how to deploy the models in production.

BookJul 2020384 pages

Mastering Machine Learning with scikit-learn

This book examines machine learning models including k-nearest neighbors, logistic regression, naive Bayes, random forests, and support vector machines. You will work through document classification, image recognition, and other example problems.

BookJul 2017254 pages

Hands-On Automated Machine Learning

This book helps machine learning professionals in developing AutoML systems that can be utilized to build ML solutions. This book covers the necessary foundations and shows the most practical ways possible to get to speed with regards to creating AutoML modules.

BookApr 2018282 pages

Machine Learning with Swift

Machine learning has become a hot topic for developers who want to impart intelligent functionality to their applications. In this book, we'll show you how to incorporate various machine learning libraries available for iOS developers. You’ll quickly get acquainted with the machine learning fundamentals and implement various algorithms with Swift.

BookFeb 2018378 pages

Python Machine Learning

This second edition of Python Machine Learning by Sebastian Raschka is for developers and data scientists looking for a practical approach to machine learning and deep learning. In this updated edition, you’ll explore the machine learning process using Python and the latest open source technologies, including scikit-learn and TensorFlow 1.x.

BookSep 2017622 pages

Deep Learning with PyTorch Quick Start Guide

PyTorch is extremely powerful and yet easy to learn. It provides advanced features such as supporting multiprocessor, distributed and parallel computation. This book is an excellent entry point for those wanting to explore deep learning with PyTorch to harness its power.

BookDec 2018158 pages

Hands-On Unsupervised Learning with Python

Unsupervised learning is a key required block in both machine learning and deep learning domains. You will explore how to make your models learn, grow, change, and develop by themselves whenever they are exposed to a new set of data. With this book, you will learn the art of unsupervised learning for different real-world challenges.

BookFeb 2019386 pages

Hands-On Neural Networks

This book will be a journey for beginners who want to step into the world of deep learning and artificial intelligence. It will thoughtfully take you through the training and implementation of various neural network architectures using the Python ecosystem. You will master each neural network architecture while understanding its working mechanism.

BookMay 2019280 pages

Machine Learning Algorithms

Machine learning explores the study and construction of algorithms that can learn from, and make predictions on, data. This book will act as an entry point for anyone who wants to make a career in the field of Machine Learning. A few famous algorithms that are covered in this book are Linear regression, Logistic Regression, SVM, Naïve Bayes, K-Means, Random Forest, TensorFlow, and Feature engineering.

BookJul 2017360 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages