Agent Hobbs: Agent… Agent O’Connor… O’Connor!
Agent O’Connor: Blimey! Uh.. Ohh.. Sorry, sir!
Hobbs: ‘Tat’s abou’ the fifth time oive’ caught you sleeping on duty, young man!
O’Connor: Umm. Apologies, sir. I just arrived here, and didn’t have much to…
Hobbs: Cut the bull, agent! There’s an important case oime workin’ on and oi’ need information on this righ’ awai’! Here’s the list of missing persons kidnapped so far by the suspects. The suspects now taunt us with open clues abou’ their next target! Based on the information, we’ve narrowed their target list down to Miss. Gibbons and Mr. Hudson.
Hobbs throws a file across O’Connor’s desk.
Hobbs says as he storms out the door: You ‘ave an hour to find out who needs the special security, so better get working.
O’Connor: Yes, sir! Bloody hell, that was close!
Here’s the information O’Connor has:
He needs to find what the probability is, that the 11th person with a height of 172cm, weight of 60kg, and with long hair is a man.
O’Connor gets to work. To simplify matters, he removes the column Hair length as well as the column Gender, since he would like to cluster the people in the table based on their height and weight. To find out whether the 11th person in the table is more likely to be a man or a woman, he uses Clustering:
O’Connor may apply scaling to the initial data, but to simplify the matters, he uses the unscaled data in the algorithm. He clusters the data into the two clusters since there are two possibilities for genders – a male or a female. Then he aims to classify a person with the height 172cm and weight 60kg to be more likely a man if and only if there are more men in that cluster. The clustering algorithm is a very efficient technique. Thus classifying this way is very fast, especially if there are a large number of the features to classify.
So he goes on to apply the k-means clustering algorithm to the data he has. First, he picks up the initial centroids. He assumes the first centroid be, for example, a person with the height 180cm and the weight 75kg denoted in a vector as (180,75). Then the point that is furthest away from (180,75) is (155,46). So that will be the second centroid.
The points that are closer to the first centroid (180,75) by taking Euclidean distance are (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). So these points will be in the first cluster. The points that are closer to the second centroid (155,46) are (155,46), (164,53), (162,52), (166,55). So these points will be in the second cluster. He displays the current situation of these two clusters in the image as below.
Clustering of people by their height and weight
He then recomputes the centroids of the clusters. The blue cluster with the features (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60) will have the centroid ((180+174+184+168+178+170+172)/7 (75+71+83+63+70+59+60)/7)~(175.14,68.71).
The red cluster with the features (155,46), (164,53), (162,52), (166,55) will have the centroid ((155+164+162+166)/4, (46+53+52+55)/4) = (161.75, 51.5).
Reclassifying the points using the new centroid, the classes of the points do not change. The blue cluster will have the points (180,75), (174,71), (184,83), (168,63), (178,70), (170,59), (172,60). The red cluster will have the points (155,46), (164,53), (162,52), (166,55). Therefore the clustering algorithm terminates with clusters as displayed in the following image:
Clustering of people by their height and weight
Now he classifies the instance (172,60) as to whether it is a male or a female. The instance (172,60) is in the blue cluster. So it is similar to the features in the blue cluster. Are the remaining features in the blue cluster more likely males or females? 5 out of 6 features are males, only 1 is a female. Since the majority of the features are males in the blue cluster and the person (172,60) is in the blue cluster as well, he classifies the person with the height 172cm and the weight 60kg as a male.
O’Connor implements the k-means clustering algorithm in Python. It takes as an input a CSV file with one data item per line. A data item is converted to a point. The algorithm classifies these points into the specified number of clusters. In the end, the clusters are visualized on the graph using the matplotlib library:
# source_code/5/k-means_clustering.py
import math
import imp
import sys
import matplotlib.pyplot as plt
import matplotlib
import sys
sys.path.append('../common')
import common # noqa
matplotlib.style.use('ggplot')
# Returns k initial centroids for the given points.
def choose_init_centroids(points, k):
centroids = []
centroids.append(points[0])
while len(centroids) < k:
# Find the centroid that with the greatest possible distance
# to the closest already chosen centroid.
candidate = points[0]
candidate_dist = min_dist(points[0], centroids)
for point in points:
dist = min_dist(point, centroids)
if dist > candidate_dist:
candidate = point
candidate_dist = dist
centroids.append(candidate)
return centroids
# Returns the distance of a point from the closest point in points.
def min_dist(point, points):
min_dist = euclidean_dist(point, points[0])
for point2 in points:
dist = euclidean_dist(point, point2)
if dist < min_dist:
min_dist = dist
return min_dist
# Returns an Euclidean distance of two 2-dimensional points.
def euclidean_dist((x1, y1), (x2, y2)):
return math.sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2))
# PointGroup is a tuple that contains in the first coordinate a 2d point
# and in the second coordinate a group which a point is classified to.
def choose_centroids(point_groups, k):
centroid_xs = [0] * k
centroid_ys = [0] * k
group_counts = [0] * k
for ((x, y), group) in point_groups:
centroid_xs[group] += x
centroid_ys[group] += y
group_counts[group] += 1
centroids = []
for group in range(0, k):
centroids.append((
float(centroid_xs[group]) / group_counts[group],
float(centroid_ys[group]) / group_counts[group]))
return centroids
# Returns the number of the centroid which is closest to the point.
# This number of the centroid is the number of the group where
# the point belongs to.
def closest_group(point, centroids):
selected_group = 0
selected_dist = euclidean_dist(point, centroids[0])
for i in range(1, len(centroids)):
dist = euclidean_dist(point, centroids[i])
if dist < selected_dist:
selected_group = i
selected_dist = dist
return selected_group
# Reassigns the groups to the points according to which centroid
# a point is closest to.
def assign_groups(point_groups, centroids):
new_point_groups = []
for (point, group) in point_groups:
new_point_groups.append(
(point, closest_group(point, centroids)))
return new_point_groups
# Returns a list of pointgroups given a list of points.
def points_to_point_groups(points):
point_groups = []
for point in points:
point_groups.append((point, 0))
return point_groups
# Clusters points into the k groups adding every stage
# of the algorithm to the history which is returned.
def cluster_with_history(points, k):
history = []
centroids = choose_init_centroids(points, k)
point_groups = points_to_point_groups(points)
while True:
point_groups = assign_groups(point_groups, centroids)
history.append((point_groups, centroids))
new_centroids = choose_centroids(point_groups, k)
done = True
for i in range(0, len(centroids)):
if centroids[i] != new_centroids[i]:
done = False
Break
if done:
return history
centroids = new_centroids
# Program start
csv_file = sys.argv[1]
k = int(sys.argv[2])
everything = False
# The third argument sys.argv[3] represents the number of the step of the
# algorithm starting from 0 to be shown or "last" for displaying the last
# step and the number of the steps.
if sys.argv[3] == "last":
everything = True
Else:
step = int(sys.argv[3])
data = common.csv_file_to_list(csv_file)
points = data_to_points(data) # Represent every data item by a point.
history = cluster_with_history(points, k)
if everything:
print "The total number of steps:", len(history)
print "The history of the algorithm:"
(point_groups, centroids) = history[len(history) - 1]
# Print all the history.
print_cluster_history(history)
# But display the situation graphically at the last step only.
draw(point_groups, centroids)
else:
(point_groups, centroids) = history[step]
print "Data for the step number", step, ":"
print point_groups, centroids
draw(point_groups, centroids)
He saves data from the classification into the CSV file:
# source_code/5/persons_by_height_and_weight.csv
180,75
174,71
184,83
168,63
178,70
170,59
164,53
155,46
162,52
166,55
172,60
O’Connor runs the program implementing k-means clustering algorithm on the data from the classification. The numerical argument 2 means that he would like to cluster the data into 2 clusters:
$ python k-means_clustering.py persons_by_height_weight.csv 2 last
The total number of steps: 2
The history of the algorithm:
Step number 0: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)]
centroids = [(180.0, 75.0), (155.0, 46.0)]
Step number 1: point_groups = [((180.0, 75.0), 0), ((174.0, 71.0), 0),
((184.0, 83.0), 0), ((168.0, 63.0), 0), ((178.0, 70.0), 0), ((170.0, 59.0), 0), ((164.0, 53.0), 1), ((155.0, 46.0), 1), ((162.0, 52.0), 1), ((166.0, 55.0), 1), ((172.0, 60.0), 0)]
centroids = [(175.14285714285714, 68.71428571428571), (161.75, 51.5)]
The program also outputs a graph visible in the 2nd image. The parameter last means that O’Connor would like the program to do the clustering until the last step. If he wants to display only the first step (step 0), he can change last to 0 to run:
$ python k-means_clustering.py persons_by_height_weight.csv 2 0
Upon the execution of the program, O’Connor gets the graph of the clusters and their centroids at the initial step, as in image 1. He heaves a sigh of relief.
Hobbs returns just then: Oye there O’Connor, not snoozing again now O’are ya?
O’Connor: Not at all, sir. I think we need to provide Mr. Hudson with special protection because it looks like he’s the next target.
Hobbs raises an eyebrow as he adjusts his gun in it’s holster: Emm, O’are ya sure, agent?
O’Connor replies with a smile: 83.33% confident, sir!
Hobbs: Wha’ are we waiting for then, eh? Let’s go get em!
If you liked reading this mystery, go ahead and buy the book it was inspired by: Data Science Algorithms in a Week, by David Natingga.