By the end of this chapter, you will be able to:
Distinguish between supervised learning and unsupervised learning
Explain the concept of clustering
Implement k-means clustering algorithms using built-in Python packages
Calculate the Silhouette Score for your data
Have you ever been asked to take a look at some data and come up empty handed? Maybe you were not familiar with the dataset, or maybe you didn't even know where to start. This may have been extremely frustrating, and even embarrassing, depending on who asked you to take care of the task.
You are not alone, and, interestingly enough, there are many times the data itself is simply too confusing to be made sense of. As you try and figure out what all those numbers in your spreadsheet mean, you're most likely mimicking what many unsupervised algorithms do when they try to find meaning in data. The reality is that many datasets in the real world don't have any rhyme or reason to them. You will be tasked with analyzing them with little background preparation. Don't fret, however – this book will prepare you so that you'll never be frustrated again when dealing with data exploration tasks.
For this book, we have developed some best-in-class content to help you understand how unsupervised algorithms work and where to use them. We'll cover some of the foundations of finding clusters in your data, how to reduce the size of your data so it's easier to understand, and how each of these sides of unsupervised learning can be applied in the real world. We hope you will come away from this book with a strong real-world understanding of unsupervised learning, the problems that it can solve, and those it cannot.
Thanks for joining us and we hope you enjoy the ride!
Unsupervised learning is one of the most exciting areas of development in machine learning today. If you have explored machine learning bookwork before, you are probably familiar with the common breakout of problems in either supervised or unsupervised learning. Supervised learning encompasses the problem set of having a labeled dataset that can be used to either classify (for example, predicting smokers and non-smokers if you're looking at a lung health dataset) or fit a regression line on (for example, predicting the sale price of a home based on how many bedrooms it has). This model most closely mirrors an intuitive human approach to learning.
If you wanted to learn how to not burn your food with a basic understanding of cooking, you could build a dataset by putting your food on the burner and seeing how long it takes (input) for your food to burn (output). Eventually, as you continue to burn your food, you will build a mental model of when burning will occur and avoid it in the future. Development in supervised learning was once fast-paced and valuable, but it has since simmered down in recent years – many of the obstacles to knowing your data have already been tackled:
Conversely, unsupervised learning encompasses the problem set of having a tremendous amount of data that is unlabeled. Labeled data, in this case, would be data that has a supplied "target" outcome that you are trying to find the correlation to with supplied data (you know that you are looking for whether your food was burned in the preceding example). Unlabeled data is when you do not know what the "target" outcome is, and you only have supplied input data.
Building upon the previous example, imagine you were just dropped on planet Earth with zero knowledge of how cooking works. You are given 100 days, a stove, and a fridge full of food without any instructions on what to do. Your initial exploration of a kitchen could go in infinite directions – on day 10, you may finally learn how to open the fridge; on day 30, you may learn that food can go on the stove; and after many more days, you may unwittingly make an edible meal. As you can see, trying to find meaning in a kitchen devoid of adequate informational structure leads to very noisy data that is completely irrelevant to actually preparing a meal.
Unsupervised learning can be an answer to this problem. By looking back at your 100 days of data, clustering can be used to find patterns of similar days where a meal was produced, and you can easily review what you did on those days. However, unsupervised learning isn't a magical answer –simply finding clusters can be just as likely to help you to find pockets of similar yet ultimately useless data.
This challenge is what makes unsupervised learning so exciting. How can we find smarter techniques to speed up the process of finding clusters of information that are beneficial to our end goals?
Being able to find groups of similar data that exist in your dataset can be extremely valuable if you are trying to find its underlying meaning. If you were a store owner and you wanted to understand which customers are more valuable without a set idea of what valuable is, clustering would be a great place to start to find patterns in your data. You may have a few high-level ideas of what denotes a valuable customer, but you aren't entirely sure in the face of a large mountain of available data. Through clustering you can find commonalities among similar groups in your data. If you look more deeply at a cluster of similar people, you may learn that everyone in that group visits your website for longer periods of time than others. This can show you what the value is and also provides a clean sample size for future supervised learning experiments.
The following figure shows two scatterplots:
The following figure separates the scatterplots into two distinct clusters:
Both figures display randomly generated number pairs (x,y coordinates) pulled from a Gaussian distribution. Simply by glancing at Figure 1.2, it should be plainly obvious where the clusters exist in your data – in real life, it will never be this easy. Now that you know that the data can be clearly separated into two clusters, you can start to understand what differences exist between the two groups.
Rewinding a bit from where unsupervised learning fits into the larger machine learning environment, let's begin by understanding the building blocks of clustering. The most basic definition finds clusters simply as groupings of similar data as subsets of a larger dataset. As an example, imagine that you had a room with 10 people in it and each person had a job either in finance or as a scientist. If you told all of the financial workers to stand together and all the scientists to do the same, you would have effectively formed two clusters based on job types. Finding clusters can be immensely valuable in identifying items that are more similar, and, on the other end of the scale, quite different from each other.
To understand this, imagine that you were given a simple 1,000-row dataset by your employer that had two columns of numerical data as follows:
At first glance, this dataset provides no real structure or understanding – confusing to say the least!
A dimension in a dataset is another way of simply counting the number of features available. In most organized data tables, you can view the number of features as the number of columns. So, using the 1,000-row dataset example of size (1,000 x 2), you will have 1,000 observations across two dimensions:
You begin by plotting the first column against the second column to get a better idea of what the data structure looks like. There will be plenty of times where the cause of differences between groups will prove to be underwhelming, however the cases that have differences that you can take action on are extremely rewarding!
You are given two-dimensional plots. Please look at the provided two-dimensional graphs and identify the clusters, to drive the point home that machine learning is important. Without using any algorithmic approaches, identify where the clusters exist in the data.
This exercise will help start to build your intuition of how we identify clusters using our own eyes and thought processes. As you complete the exercises, think of the rationale of why a group of data points should be considered a cluster versus a group that should not be considered a cluster:
Identify the clusters in the following scatterplot:
The clusters are as follows:
Identify the clusters in the scatterplot:
The clusters are as follows:
Identify the clusters in the scatterplot:
The clusters are as follows:
Most of these examples were likely quite easy for you to understand – and that's the point! The human brain and eyes are incredible at finding patterns in the real world. Within milliseconds of viewing each plot, you could tell what fitted together and what didn't. While it is easy for you, a computer does not have the ability to see and process plots in the same manner that we do. However, this is not always a bad thing – look back at Figure 1.10. Were you able to find the six discrete clusters in the data just by looking at the plot? You probably found only three to four clusters in this figure, while a computer is able to see all six. The human brain is magnificent, but it also lacks the nuances that come within a strictly logic-based approach. Through algorithmic clustering, you will learn how to build a model that works even better than a human at these tasks!
Let's look at the algorithm in the next section.
Hopefully, by now, you can see that finding clusters is extremely valuable in a machine learning workflow. However, how can you actually find these clusters? One of the most basic yet popular approaches is by using a cluster analysis called k-means clustering. k-means works by searching for K clusters in your data and the workflow is actually quite intuitive – we will start with the no-math introduction to k-means, followed by an implementation in Python.
Here is the no-math algorithm of k-means clustering:
Pick K centroids (K = expected distinct # of clusters).
Randomly place K centroids anywhere amongst your existing training data.
Calculate the Euclidean distance from each centroid to all the points in your training data.
Training data points get grouped in with their nearest centroid.
Amongst the data points grouped into each centroid, calculate the mean data point and move your centroid to that location.
Repeat this process until convergence, or when the membership in each group no longer changes.
And that's it! Here is the process laid out step-by-step with a simple cluster example:
Provided with the original data in Figure 1.11, we can show the iterative process of k-means by showing the predicted clusters in each step:
To understand k-means at a deeper level, let's walk through the example given in the introductory section again with some of the math that supports k-means. The key component at play is the Euclidean distance formula:
Centroids are randomly set at the beginning as points in your n-dimensional space. Each of these centers is fed into the preceding formula as (a,b), and a point in your space is fed in as (x,y). Distances are calculated between each point and the coordinates of every centroid, with the centroid the shortest distance away chosen as the point's group.
The process is as follows:
Random Centroids: [ (2,5) , (8,3) , (4, 5) ]
Arbitrary point x: (0, 8)
Distance from point to each centroid: [ 3.61, 9.43, 5.00 ]
Point x is assigned to Centroid 1.
Euclidean distance is the most common distance metric for many machine learning applications and is often known colloquially as the distance metric; however, it is not the only, or even the best, distance metric for every situation. Another popular distance metric in use for clustering is Manhattan distance.
Manhattan distance is called as such because the intuition behind the metric is as though you were driving a car through a metropolis (such as New York City) that has many square blocks. Euclidean distance relies on diagonals due to it being based on Pythagorean theorem, while Manhattan distance constrains distance to only right angles. The formula for Manhattan distance is as follows:
Here, are vectors as in Euclidean distance. Building upon our examples of Euclidean distance, where we want to find the distance between two points, if and , then the Manhattan distance would equal . This functionality scales to any number of dimensions. In practice, Manhattan distance may outperform Euclidean distance when it comes to higher dimensional data.
The preceding examples are clear to visualize when your data is only two-dimensional. This is for convenience, to help drive the point home of how k-means works and could lead you into a false understanding of how easy clustering is. In many of your own applications, your data will likely be orders of magnitude larger to the point that it cannot be perceived by visualization (anything beyond three dimensions will be imperceivable to humans). In the previous examples, you could mentally work out a few two-dimensional lines to separate the data into its own groups. At higher dimensions, you will need to be aided by a computer to find an n-dimensional hyperplane that adequately separates the dataset. In practice, this is where clustering methods such as k-means provide significant value.
In the next exercise, we will calculate Euclidean distance. We will use the NumPy and Math packages. NumPy is a scientific computing package for Python that pre-packages common mathematical functions in highly-optimized formats. By using a package such as NumPy or Math, we help cut down the time spent creating custom math functions from scratch and instead focus on developing our solutions.
In this exercise, we will create an example point along with three sample centroids to help illustrate how Euclidean distance works. Understanding this distance formula is foundational to the rest of our work in clustering.
By the end of this exercise, we will be able to implement Euclidean distance from scratch and fully understand what it does to points in a feature space.
In this exercise, we will be using the standard Python built-in math package. There are no prerequisites for using the math package and it is included in all standard installations of Python. As the name suggests, this package is very useful, allowing to use a variety of basic math building blocks off the shelf, such as exponentials, square roots, and others:
Open a Jupyter notebook and create a naïve formula that captures the direct math of Euclidean distance, as follows:
import math import numpy as np def dist(a, b): return math.sqrt(math.pow(a-b,2) + math.pow(a-b,2))
This approach is considered naïve because it performs element-wise calculations on your data points (slow) compared to a more real-world implementation using vectors and matrix math to achieve significant performance increases.
Create the data points in Python as follows:
centroids = [ (2, 5), (8, 3), (4,5) ] x = (0, 8)
Use the formula you created to calculate the Euclidean distance between the example point and each of the three centroids you were provided:
centroid_distances = for centroid in centroids: centroid_distances.append(dist(x,centroid)) print(centroid_distances) print(np.argmin(centroid_distances))
The output is as follows:
[3.605551275463989, 9.433981132056603, 5.0] 0
Since Python is zero-indexed, a position of zero as the minimum in our list of centroid distances signals to us that the example point, x, will be assigned to the number one centroid of three.
This process is repeated for every point in the dataset until each point is assigned to a cluster. After each point is assigned, the mean point is calculated among all of the points within each cluster. The calculation of the mean among these points is the same as calculating a mean between single integers.
Now that you have found clusters in your data using Euclidean distance as the primary metric, think back to how you did this easily in Exercise 2, Calculating Euclidean Distance in Python. It is very intuitive for our human minds to see groups of dots on a plot and determine which dots belong to discrete clusters. However, how do we ask a naïve computer to repeat this same task? By understanding this exercise, you help teach a computer an approach to forming clusters of its own with the notion of distance. We will build upon how we use these distance metrics in the next exercise.
By understanding this exercise, you'll help to teach a computer an approach to forming clusters of its own with the notion of distance. We will build upon how we use these distance metrics in this exercise:
Store the points [ (0,8), (3,8), (3,4) ] that are assigned to cluster one:
cluster_1_points =[ (0,8), (3,8), (3,4) ]
Calculate the mean point between all of the points to find the new centroid:
mean =[ (0+3+3)/3, (8+8+4)/3 ] print(mean)
The output is as follows:
After a new centroid is calculated, you will repeat the cluster membership calculation seen in Exercise 2, Calculating Euclidean Distance in Python, and then the previous two steps to find the new cluster centroid. Eventually, the new cluster centroid will be the same as the one you had entering the problem, and the exercise will be complete. How many times this repeats depends on the data you are clustering.
Once you have moved the centroid location to the new mean point of (2, 6.67), you can compare it to the initial list of centroids you entered the problem with. If the new mean point is different than the centroid that is currently in your list, that means you have to go through another iteration of the preceding two exercises. Once the new mean point you calculate is the same as the centroid you started the problem with, you have completed a run of k-means and reached a point called convergence.
In the next exercise, we will implement k-means from scratch.
In this exercise, we will have a look at the implementation of k-means from scratch. This exercise relies on scikit-learn, an open-source Python package that enables the fast prototyping of popular machine learning models. Within scikit-learn, we will be using the datasets functionality to create a synthetic blob dataset. In addition to harnessing the power of scikit-learn, we will also rely on Matplotlib, a popular plotting library for Python that makes it easy for us to visualize our data. To do this, perform the following steps:
Import the necessary libraries:
from sklearn.datasets import make_blobs import matplotlib.pyplot as plt import numpy as np import math %matplotlib inline
Generate a random cluster dataset to experiment on X = coordinate points, y = cluster labels, and define random centroids:
X, y = make_blobs(n_samples=1500, centers=3, n_features=2, random_state=800) centroids = [[-6,2],[3,-4],[-5,10]]
Print the data:
The output is as follows:
array([[-3.83458347, 6.09210705], [-4.62571831, 5.54296865], [-2.87807159, -7.48754592], ..., [-3.709726 , -7.77993633], [-8.44553266, -1.83519866], [-4.68308431, 6.91780744]])
Plot the coordinate points as follows:
plt.scatter(X[:, 0], X[:, 1], s=50, cmap='tab20b') plt.show()
The plot looks as follows:
Print the array of y:
The output is as follows:
array([2, 2, 1, ..., 1, 0, 2])
Plot the coordinate points with the correct cluster labels:
plt.scatter(X[:, 0], X[:, 1], c=y,s=50, cmap='tab20b') plt.show()
The plot looks as follows:
Let's recreate these results on our own! We will go over an example implementing this with some optimizations. This exercise is built on top of the previous exercise and should be performed in the same Jupyter notebook. For this exercise, we will rely on SciPy, a Python package that allows easy access to highly optimized versions of scientific calculations. In particular, we will be implementing Euclidean distance with cdist, the functionally of which replicates the barebones implementation of our distance metric in a much more efficient manner:
A non-vectorized implementation of Euclidean distance is as follows:
def dist(a, b): return math.sqrt(math.pow(a-b,2) + math.pow(a-b,2))
Now, implement the optimized Euclidean distance:
from scipy.spatial.distance import cdist
Store the values of X:
The output is as follows:
array([[-3.09897933, 4.79407445], [-3.37295914, -7.36901393], [-3.372895 , 5.10433846], [-5.90267987, -3.28352194], [-3.52067739, 7.7841276 ]])
Calculate the distances and choose the index of the shortest distance as a cluster:
for x in X[105:110]: calcs =  for c in centroids: calcs.append(dist(x, c)) print(calcs, "Cluster Membership: ", np.argmin(calcs, axis=0))
Define the k_means function as follows and initialize k-centroids randomly. Repeat the process until the difference between new/old centroids equal 0 using the while loop:
def k_means(X, K): # Keep track of history so you can see k-means in action centroids_history =  labels_history =  rand_index = np.random.choice(X.shape, K) centroids = X[rand_index] centroids_history.append(centroids) while True: # Euclidean distances are calculated for each point relative to # centroids, and then np.argmin returns the index location of the # minimal distance - which cluster a point is assigned to labels = np.argmin(cdist(X, centroids), axis=1) labels_history.append(labels) # Take mean of points within clusters to find new centroids new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(K)]) centroids_history.append(new_centroids) # If old centroids and new centroids no longer change, k-means is # complete and end. Otherwise continue if np.all(centroids == new_centroids): break centroids = new_centroids return centroids, labels, centroids_history, labels_history centers, labels, centers_hist, labels_hist = k_means(X, 3)
Zip together the historical steps of centers and their labels:
for x, y in history: plt.figure(figsize=(4,3)) plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='tab20b'); plt.scatter(x[:, 0], x[:, 1], c='red')
As you can see in the above figures, k-means takes an iterative approach to refining optimal clusters based on distance. The algorithm starts with random initialization and depending on the complexity of the data, quickly finds the separations that make the most sense.
The first plot is as follows:
The second plot is as follows:
The third plot is as follows:
Understanding the performance of unsupervised learning methods is inherently much more difficult than supervised learning methods because, often, there is no clear-cut "best" solution. For supervised learning, there are many robust performance metrics – the most straightforward of these being accuracy in the form of comparing model-predicted labels to actual labels and seeing how many the model got correct. Unfortunately, for clustering, we do not have labels to rely on and need to build an understanding of how "different" our clusters are. We achieve this with the Silhouette Score metric. Inherent to this approach, we can also use Silhouette Scores to find optimal "K" numbers of clusters for our unsupervised learning methods.
The Silhouette metric works by analyzing how well a point fits within its cluster. The metric ranges from -1 to 1 – If the average silhouette score across your clustering is one, then you will have achieved perfect clusters and there will be minimal confusion about which point belongs where. If you think of the plots in our last exercise, the Silhouette score will be much closer to one, since the blobs are tightly condensed and there is a fair amount of distance between each blob. This is very rare though – the Silhouette Score should be treated as an attempt at doing the best you can, since hitting one is highly unlikely.
Mathematically, the Silhouette Score calculation is quite straightforward via the Simplified Silhouette Index (SSI), as where is the distance from point i to its own cluster centroid and is the distance from point i to the nearest cluster centroid.
The intuition captured here is that represents how cohesive point i's cluster is as a clear cluster, and represents how far apart the clusters lie. We will use the optimized implementation of silhouette_score in scikit-learn for Activity 1, Implementing k-means Clustering. Using it is simple and only requires you to pass in the feature array and the predicted cluster labels from your k-means clustering method.
In the next exercise, we will use the pandas library to read a CSV. Pandas is a Python library that makes data wrangling easier through the use of DataFrames. To read data in Python, you will use variable_name = pd.read_csv('file_name.csv', header=None).
In this exercise, we're going to learn how to calculate the Silhouette Score of a dataset with a fixed number of clusters. For this, we will use the Iris dataset, which is available at https://github.com/TrainingByPackt/Unsupervised-Learning-with-Python/tree/master/Lesson01/Exercise06.
This dataset was downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data. It can be accessed at https://github.com/TrainingByPackt/Unsupervised-Learning-with-Python/tree/master/Lesson01/Exercise06.
Load the Iris data file using pandas, a package that makes data wrangling much easier through the use of DataFrames:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score from scipy.spatial.distance import cdist iris = pd.read_csv('iris_data.csv', header=None) iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']
Separate the X features, since we want to treat this as an unsupervised learning problem:
X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
Bring back the k_means function we made earlier for reference:
def k_means(X, K): #Keep track of history so you can see k-means in action centroids_history =  labels_history =  rand_index = np.random.choice(X.shape, K) centroids = X[rand_index] centroids_history.append(centroids) while True: # Euclidean distances are calculated for each point relative to # centroids, #and then np.argmin returns # the index location of the minimal distance - which cluster a point # is #assigned to labels = np.argmin(cdist(X, centroids), axis=1) labels_history.append(labels) #Take mean of points within clusters to find new centroids: new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(K)]) centroids_history.append(new_centroids) # If old centroids and new centroids no longer change, k-means is # complete and end. Otherwise continue if np.all(centroids == new_centroids): break centroids = new_centroids return centroids, labels, centroids_history, labels_history
Convert our Iris X feature DataFrame to a NumPy matrix:
X_mat = X.values
Run our k_means function on the Iris matrix:
centroids, labels, centroids_history, labels_history = k_means(X_mat, 3)
Calculate the Silhouette Score for the PetalLengthCm and PetalWidthCm columns:
The output is similar to:
In this exercise, we calculated the Silhouette Score for the PetalLengthCm and PetalWidthCm columns of the Iris dataset.
Scenario: You are asked in an interview to implement a k-means clustering algorithm from scratch to prove that you understand how it works. We will be using the Iris dataset provided by the UCI ML repository. The Iris dataset is a classic in the data science world and has features that are used to predict Iris species. The download location can be found later in this activity.
For this activity, you are able to use Matplotlib, NumPy, scikit-learn metrics, and pandas.
By loading and reshaping data easily, you can focus more on learning k-means instead of writing dataloader functionality.
Iris data columns are provided as follows for reference:
['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']
Aim: To truly understand how something works, you need to build it from scratch. Take what you have learned in the previous sections and implement k-means from scratch in Python.
Please open your favorite editing platform and try the following:
Using NumPy or the math package and the Euclidean distance formula and write a function that calculates the distance between two coordinates.
Write a function that calculates the distance from centroids to each of the points in your dataset and returns the cluster membership.
Write a k-means function that takes in a dataset and the number of clusters (K) and returns the final cluster centroids, as well as the data points that make up that cluster's membership. After implementing k-means from scratch, apply your custom algorithm to the Iris dataset, located here: https://github.com/TrainingByPackt/Unsupervised-Learning-with-Python/tree/master/Lesson01/Activity01.
This dataset was downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data. It can be accessed at https://github.com/TrainingByPackt/Unsupervised-Learning-with-Python/tree/master/Lesson01/Activity01.
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Remove the classes supplied in this dataset and see if your k-means algorithm can group the different Iris species into their proper groups just based on plant characteristics!
Calculate the Silhouette Score using the scikit-learn implementation.
Outcome: By completing this exercise, you will gain hands-on experience of tuning a k-means clustering algorithm for a real-world dataset. The Iris dataset is seen as a classic "hello world" type problem in the data science space and is helpful for testing foundational techniques on. Your final clustering algorithm should do a decent job of finding the three clusters of Iris species types that exist in the data, as follows:
In this chapter, we have explored what clustering is and why it is important in a variety of data challenges. Building upon this foundation of clustering knowledge, you implemented k-means, which is one of the simplest yet most popular methods of unsupervised learning. If you have reached this summary and can repeat what k-means does step-by-step to your fellow classmate, good job! If not, please go back and review the previous material – the content only grows in complexity from here. From here, we will be moving on to hierarchical clustering, which, in one configuration, reuses the centroid learning approach that we used in k-means. We will build upon this approach by outlining additional clustering methodologies and approaches in the next chapter.