A nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbors algorithm is the class with the highest representation among the k-closest neighbors.
In this chapter, we will cover the following topics:
- How to implement the basics of the k-NN algorithm using the example of Mary and her temperature preferences
- How to choose a correctk value so that the algorithm can perform correctly and with the highest degree of accuracy using the example of a map of Italy
- How to rescale values and prepare them for the k-NN algorithm using the example of house preferences
- How to choose a good metric to measure distances between data points
- How to eliminate irrelevant dimensions in higher-dimensional space to ensure that the algorithm performs accurately using the text classification example
As an example, if we know that our friend, Mary, feels cold when it is 10°C, but warm when it is 25°C, then in a room where it is 22°C, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10.
Suppose that we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary is asked whether she feels warm or cold:
Temperature in °C | Wind speed in km/h | Mary's perception |
10 | 0 | Cold |
25 | 0 | Warm |
15 | 5 | Cold |
20 | 3 | Warm |
18 | 7 | Cold |
20 | 10 | Cold |
22 | 5 | Warm |
24 | 6 | Warm |
We could represent the data in a graph, as follows:
Now, suppose we would like to find out how Mary feels when the temperature is 16°C with a wind speed of 3 km/h by using the 1-NN algorithm:
For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance d_{Man} of the neighbor N_{1}=(x_{1},y_{1}) from the neighbor N_{2}=(x_{2},y_{2}) is defined as d_{Man}=|x_{1}-x_{2}|+|y_{1}-y_{2}|.
Let's label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify:
We can see that the closest neighbor with a known class is the one with a temperature of 15°C (blue) and a wind speed of 5 km/h. Its distance from the point in question is three units. Its class is blue (cold). The closest red (warm) neighbour is at a distance of four units from the point in question. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the point in question should be blue (cold).
By applying this procedure to every data point, we can complete the graph, as follows:
Note that, sometimes, a data point might be the same distance away from two known classes: for example, 20°C and 6 km/h. In such situations, we could prefer one class over the other, or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm.
Now, we will implement the k-NN algorithm in Python to find Mary's temperature preference. At the end of this section, we will also implement the visualization of the data produced in the previous section, that is, Mary and her temperature preferences. The full, compilable code, with the input files, can be found in the source code provided with this book. The most important parts have been extracted and presented here:
Input:
The preceding program will use the following file as the source of the input data. The file contains the table with the known data about Mary's temperature preferences:
Output:
We run the preceding implementation on the mary_and_temperature_preferences.data
input file by using the k-NN algorithm for k=1
neighbors. The algorithm classifies all the points with integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10)
, hence, with a size of (25+1) * (10+1) = 286
integer points (adding one to count points on boundaries). Using the wc
command, we find out that the output file contains exactly 286 lines—one data item per point. Using the head
command, we display the first 10 lines from the output file:
Visualization:
For the visualization depicted earlier in this chapter, the matplotlib
library was used. A data file is loaded, and then displayed in a scatter diagram: