Reader small image

You're reading from  Machine Learning with scikit-learn Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343700
Edition1st Edition
Languages
Right arrow
Author (1)
Kevin Jolly
Kevin Jolly
author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly

Right arrow

Predicting Categories with K-Nearest Neighbors

The k-Nearest Neighbors (k-NN) algorithm is a form of supervised machine learning that is used to predict categories. In this chapter, you will learn about the following:

  • Preparing a dataset for machine learning with scikit-learn
  • How the k-NN algorithm works under the hood
  • Implementing your first k-NN algorithm to predict a fraudulent transaction
  • Fine-tuning the parameters of the k-NN algorithm
  • Scaling your data for optimized performance

The k-NN algorithm has a wide range of applications in the field of classification and supervised machine learning. Some of the real-world applications for this algorithm include predicting loan defaults and credit-based fraud in the financial industry and predicting whether a patient has cancer in the healthcare industry.

This book's design facilitates the implementation of a robust machine...

Technical requirements

Preparing a dataset for machine learning with scikit-learn

The first step to implementing any machine learning algorithm with scikit-learn is data preparation. Scikit-learn comes with a set of constraints to implementation that will be discussed later in this section. The dataset that we will be using is based on mobile payments and is found on the world's most popular competitive machine learning website – Kaggle.

You can download the dataset from: https://www.kaggle.com/ntnu-testimon/paysim1.

Once downloaded, open a new Jupyter Notebook by using the following code in Terminal (macOS/Linux) or Anaconda Prompt/PowerShell (Windows):

Jupyter Notebook

The fundamental goal of this dataset is to predict whether a mobile transaction is fraudulent. In order to do this, we need to first have a brief understanding of the contents of our data. In order to explore the dataset...

The k-NN algorithm

Mathematically speaking, the k-NN algorithm is one of the most simple machine learning algorithms out there. See the following diagram for a visual overview of how it works:

How k-NN works under the hood

The stars in the preceding diagram represent new data points. If we built a k-NN algorithm with three neighbors, then the stars would search for the three data points that are closest to it.

In the lower-left case, the star sees two triangles and one circle. Therefore, the algorithm would classify the star as a triangle since the number of triangles was greater than the number of circles.

In the upper-right case, the star sees two circles and one circle. Therefore, the algorithm will classify the star as a circle since the number of circles was greater than the number of triangles.

The real algorithm does this in a very probabilistic manner and picks the...

Implementing the k-NN algorithm using scikit-learn

In the following section, we will implement the first version of the k-NN algorithm and assess its initial accuracy. When implementing machine learning algorithms using scikit-learn, it is always a good practice to implement algorithms without fine-tuning or optimizing any of the associated parameters first in order to evaluate how well it performs.

In the following section, you will learn how to do the following:

  • Split your data into training and test sets
  • Implement the first version of the algorithm on the data
  • Evaluate the accuracy of your model using a k-NN score

Splitting the data into training and test sets

The idea of training and test sets is fundamental to every...

Fine-tuning the parameters of the k-NN algorithm

In the previous section, we arbitrarily set the number of neighbors to three while initializing the k-NN classifier. However, is this the optimal value? Well, it could be, since we obtained a relatively high accuracy score in the test set.

Our goal is to create a machine learning model that does not overfit or underfit the data. Overfitting the data means that the model has been trained very specifically to the training examples provided and will not generalize well to cases/examples of data that it has not encountered before. For instance, we might have fit the model very specifically to the training data, with the test cases being also very similar to the training data. Thus, the model would have been able to perform very well and produce a very high value of accuracy.

Underfitting is another extreme case, in which the model...

Scaling for optimized performance

The k-NN algorithm is an algorithm that works based on distance. When a new data point is thrown into the dataset and the algorithm is given the task of classifying this new data point, it uses distance to check the points that are closest to it.

If we have features that have different ranges of values – for example, feature one has a range between 0 to 800 while feature two has a range between one to five – this distance metric does not make sense anymore. We want all the features to have the same range of values so that the distance metric is on level terms across all features.

One way to do this is to subtract each value of each feature by the mean of that feature and divide by the variance of that feature. This is called standardization:

We can do this for our dataset by using the following code:

from sklearn.preprocessing...

Summary

This chapter was fundamental in helping you prepare a dataset for machine learning with scikit-learn. You have learned about the constraints that are imposed when you do machine learning with scikit-learn and how to create a dataset that is perfect for scikit-learn.

You have also learned how the k-NN algorithm works behind the scenes and have implemented a version of it using scikit-learn to predict whether a transaction was fraudulent. You then learned how to optimize the parameters of the algorithm using the popular GridSearchCV algorithm. Finally, you have learnt how to standardize and scale your data in order to optimize the performance of your model.

In the next chapter, you will learn how to classify fraudulent transactions yet again with a new algorithm – the logistic regression algorithm!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with scikit-learn Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789343700
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly