Chapter 2. Learning How to Classify with Real-world Examples
Can a machine distinguish between flower species based on images? From a machine learning perspective, we approach this problem by having the machine learn how to perform this task based on examples of each species so that it can classify images where the species are not marked. This process is called classification
(or
supervised learning), and is a classic problem that goes back a few decades.
We will explore small datasets using a few simple algorithms that we can implement manually. The goal is to be able to understand the basic principles of classification. This will be a solid foundation to understanding later chapters as we introduce more complex methods that will, by necessity, rely on code written by others.
The Iris dataset is a classic dataset from the 1930s; it is one of the first modern examples of statistical classification.
The setting is that of Iris flowers, of which there are multiple species that can be identified by their morphology. Today, the species would be defined by their genomic signatures, but in the 1930s, DNA had not even been identified as the carrier of genetic information.
The following four attributes of each plant were measured:
Sepal length
Sepal width
Petal length
Petal width
In general, we will call any measurement from our data as
features.
Additionally, for each plant, the species was recorded. The question now is: if we saw a new flower out in the field, could we make a good prediction about its species from its measurements?
This is the
supervised learning or
classification problem; given labeled examples, we can design a rule that will eventually be applied to other examples. This is the same setting that is used for spam classification; given the...
Building more complex classifiers
In the previous section, we used a very simple model: a threshold on one of the dimensions. Throughout this book, you will see many other types of models, and we're not even going to cover everything that is out there.
What makes up a classification model? We can break it up into three parts:
The structure of the model: In this, we use a threshold on a single feature.
The search procedure: In this, we try every possible combination of feature and threshold.
The loss function: Using the loss function, we decide which of the possibilities is less bad (because we can rarely talk about the perfect solution). We can use the training error or just define this point the other way around and say that we want the best accuracy. Traditionally, people want the loss function to be minimum.
We can play around with these parts to get different results. For example, we can attempt to build a threshold that achieves minimal training error, but we will only test three values...
A more complex dataset and a more complex classifier
We will now look at a slightly more complex dataset. This will motivate the introduction of a new classification algorithm and a few other ideas.
Learning about the Seeds dataset
We will now look at another agricultural dataset; it is still small, but now too big to comfortably plot exhaustively as we did with Iris. This is a dataset of the measurements of wheat seeds. Seven features are present, as follows:
Area (A)
Perimeter (P)
-
Compactness ()
Length of kernel
Width of kernel
Asymmetry coefficient
Length of kernel groove
There are three classes that correspond to three wheat varieties: Canadian, Koma, and Rosa. As before, the goal is to be able to classify the species based on these morphological measurements.
Unlike the Iris dataset, which was collected in the 1930s, this is a very recent dataset, and its features were automatically computed from digital images.
This is how image pattern recognition can be implemented: you can take images in...
Binary and multiclass classification
The first classifier we saw, the threshold classifier, was a simple binary classifier (the result is either one class or the other as a point is either above the threshold or it is not). The second classifier we used, the nearest neighbor classifier, was a naturally multiclass classifier (the output can be one of several classes).
It is often simpler to define a simple binary method than one that works on multiclass problems. However, we can reduce the multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset in a haphazard way; we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions:
Of course, we want to leave this sort of reasoning to the computer. As usual, there are several solutions to this multiclass reduction.
The simplest is to use...
In a sense, this was a very theoretical chapter, as we introduced generic concepts with simple examples. We went over a few operations with a classic dataset. This, by now, is considered a very small problem. However, it has the advantage that we were able to plot it out and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The intuitions we gained here will all still be valid.
Classification means generalizing from examples to build a model (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning, and we will see many more examples of this in forthcoming chapters.
We also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must, instead, evaluate it on testing data that was not used for training. In order to not waste too many examples in testing, a cross...