Reader small image

You're reading from  Machine Learning for Developers

Product typeBook
Published inOct 2017
Reading LevelBeginner
PublisherPackt
ISBN-139781786469878
Edition1st Edition
Languages
Right arrow
Authors (2):
Rodolfo Bonnin
Rodolfo Bonnin
author image
Rodolfo Bonnin

Rodolfo Bonnin is a systems engineer and Ph.D. student at Universidad Tecnolgica Nacional, Argentina. He has also pursued parallel programming and image understanding postgraduate courses at Universitt Stuttgart, Germany. He has been doing research on high-performance computing since 2005 and began studying and implementing convolutional neural networks in 2008, writing a CPU- and GPU-supporting neural network feedforward stage. More recently he's been working in the field of fraud pattern detection with Neural Networks and is currently working on signal classification using machine learning techniques. He is also the author of Building Machine Learning Projects with Tensorflow and Machine Learning for Developers by Packt Publishing.
Read more about Rodolfo Bonnin

View More author details
Right arrow

Clustering

Congratulations! You have finished this book's introductory section, in which you have explored a great number of topics, and if you were able to follow it, you are prepared to start the journey of understanding the inner workings of many machine learning models.

In this chapter, we will explore some effective and simple approaches for automatically finding interesting data conglomerates, and so begin to research the reasons for natural groupings in data.

This chapter will covers the following topics:

  • A line-by-line implementation of an example of the K-means algorithm, with explanations of the data structures and routines
  • A thorough explanation of the k-nearest neighbors (K-NN) algorithm, using a code example to explain the whole process
  • Additional methods of determining the optimal number of groups representing a set of samples
...

Grouping as a human activity

Humans typically tend to agglomerate everyday elements into groups of similar features. This feature of the human mind can also be replicated by an algorithm. Conversely, one of the simplest operations that can be initially applied to any unlabeled dataset is to group elements around common features.

As we have described, in this stage of the development of the discipline, clustering is taught as an introductory theme that's applied to the simplest categories of element sets.

But as an author, I recommend researching this domain, because the community is hinting that the current model's performance will all reach a plateau, before aiming for the full generalization of tasks in AI. And what kinds of method are the main candidates for the next stages of crossing the frontier towards AI? Unsupervised methods, in the form of very sophisticated...

Automating the clustering process

The grouping of information for clustering follows a common pattern for all techniques. Basically, we have an initialization stage, followed by the iterative insertion of new elements, after which the new group relationships are updated. This process continues until the stop criteria is met, where the group characterization is finished. The following flow diagram illustrates this process:

General scheme for a clustering algorithm

After we get a clear sense of the overall process, let's start working with several cases where this scheme is applied, starting with K-means.

Finding a common center - K-means

Here we go! After some necessary preparation review, we will finally start to learn from data; in this case, we are looking to label data we observe in real life.

In this case, we have the following elements:

  • A set of N-dimensional elements of numeric type
  • A predetermined number of groups (this is tricky because we have to make an educated guess)
  • A set of common representative points for each group (called centroids)

The main objective of this method is to split the dataset into an arbitrary number of clusters, each of which can be represented by the mentioned centroids.

The word centroid comes from the mathematics world, and has been translated to calculus and physics. Here we find a classical representation of the analytical calculation of a triangle's centroid:

Graphical depiction of the centroid finding scheme for a triangle

The centroid...

Nearest neighbors

K-NN is another classical method of clustering. It builds groups of samples, supposing that each new sample will have the same class as its neighbors, without looking for a global representative central sample. Instead, it looks at the environment, looking for the most frequent class on each new sample's environment.

Mechanics of K-NN

K-NN can be implemented in many configurations, but in this chapter we will use the semi-supervised approach, starting from a certain number of already assigned samples, and later guessing the cluster membership using the main criteria.

In the following diagram, we have a breakdown of the algorithm. It can be summarized with the following steps:

Flowchart for the K-NN...

K-NN sample implementation

For this simple implementation of the K-NN method, we will use the NumPy and Matplotlib libraries. Also, as we will be generating a synthetic dataset for better comprehension, we will use the make_blobs method from scikit-learn, which will generate well-defined and separated groups of information so we have a sure reference for our implementation.

Importing the required libraries:

    import numpy as np

import matplotlib
import matplotlib.pyplot as plt

from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline

So, it's time to generate the data samples for this example. The parameters of make_blobs are the number of samples, the number of features or dimensions, the quantity of centers or groups, whether the samples have to be shuffled, and the standard deviation of the cluster, to control how dispersed...

Summary

In this chapter, we have covered the simplest but still very practical machine learning models in an eminently practical way to get us started on the complexity scale.

In the following chapter, where we will cover several regression techniques, it will be time to go and solve a new type of problem that we have not worked on, even if it's possible to solve the problem with clustering methods (regression), using new mathematical tools for approximating unknown values. In it, we will model past data using mathematical functions, and try to model new output based on those modeling functions.

References

  • Thorndike, Robert L, Who belongs in the family?, Psychometrika18.4 (1953): 267-276.
  • Steinhaus, H, Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci 1 (1956): 801–804.
  • MacQueen, James, Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 14. 1967.
  • Cover, Thomas, and Peter Hart, Nearest neighbor pattern classification. IEEE transactions on information theory 13.1 (1967): 21-27.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning for Developers
Published in: Oct 2017Publisher: PacktISBN-13: 9781786469878
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rodolfo Bonnin

Rodolfo Bonnin is a systems engineer and Ph.D. student at Universidad Tecnolgica Nacional, Argentina. He has also pursued parallel programming and image understanding postgraduate courses at Universitt Stuttgart, Germany. He has been doing research on high-performance computing since 2005 and began studying and implementing convolutional neural networks in 2008, writing a CPU- and GPU-supporting neural network feedforward stage. More recently he's been working in the field of fraud pattern detection with Neural Networks and is currently working on signal classification using machine learning techniques. He is also the author of Building Machine Learning Projects with Tensorflow and Machine Learning for Developers by Packt Publishing.
Read more about Rodolfo Bonnin