Reader small image

You're reading from  Machine Learning with scikit-learn Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343700
Edition1st Edition
Languages
Right arrow
Author (1)
Kevin Jolly
Kevin Jolly
author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly

Right arrow

Performance Evaluation Methods

Your method of performance evaluation will vary by the type of machine learning algorithm that you choose to implement. In general, there are different metrics that can potentially determine how well your model is performing at its given task for classification, regression, and unsupervised machine learning algorithms.

In this chapter, we will explore how the different performance evaluation methods can help you to better understand your model. The chapter will be split into three sections, as follows:

  • Performance evaluation for classification algorithms
  • Performance evaluation for regression algorithms
  • Performance evaluation for unsupervised algorithms

Technical requirements

Why is performance evaluation critical?

It is key for you to understand why we need to evaluate the performance of a model in the first place. Some of the potential reasons why performance evaluation is critical are as follows:

  • It prevents overfitting: Overfitting occurs when your algorithm hugs the data too tightly and makes predictions that are specific to only one dataset. In other words, your model cannot generalize its predictions outside of the data that it was trained on.
  • It prevents underfitting: This is the exact opposite of overfitting. In this case, the model is very generic in nature.
  • Understanding predictions: Performance evaluation methods will help you to understand, in greater detail, how your model makes predictions, along with the nature of those predictions and other useful information, such as the accuracy of your model.
...

Performance evaluation for classification algorithms

In order to evaluate the performance of classification, let's consider the two classification algorithms that we have built in this book: k-nearest neighbors and logistic regression.

The first step will be to implement both of these algorithms in the fraud detection dataset. We can do this by using the following code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model

#Reading in the fraud detection dataset

df = pd.read_csv('fraud_prediction.csv')

#Creating the features

features = df.drop('isFraud', axis = 1).values
target = df['isFraud'].values

#Splitting the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3, random_state ...

Performance evaluation for regression algorithms

There are three main metrics that you can use to evaluate the performance of the regression algorithm that you built, as follows:

  • Mean absolute error (MAE)
  • Mean squared error (MSE)
  • Root mean squared error (RMSE)

In this section, you will learn what the three metrics are, how they work, and how you can implement them using scikit-learn. The first step is to build the linear regression algorithm. We can do this by using the following code:

## Building a simple linear regression model

#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Define the feature and target arrays

feature = df['oldbalanceOrg'].values
target = df['amount'].values

#Initializing a linear regression model

linear_reg = linear_model.LinearRegression()

#Reshaping the array since we only have a single feature

feature = feature...

Performance evaluation for unsupervised algorithms

In this section, you will learn how to evaluate the performance of an unsupervised machine learning algorithm, such as the k-means algorithm. The first step is to build a simple k-means model. We can do so by using the following code:

#Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

#Dropping the target feature & the index

df = df.drop(['Unnamed: 0', 'isFraud'], axis = 1)

#Initializing K-means with 2 clusters

k_means = KMeans(n_clusters = 2)

Now that we have a simple k-means model with two clusters, we can proceed to evaluate the model's performance. The different visual performance charts that can be deployed are as follows:

  • Elbow plot
  • Silhouette analysis plot

In this section, you will learn how to create and interpret each of the preceding plots.

...

Summary

In this chapter, you learned how to evaluate the performances of the three different types of machine learning algorithms: classification, regression, and unsupervised.

For the classification algorithms, you learned how to evaluate the performance of a model by using a series of visual techniques, such as the confusion matrix, normalized confusion matrix, area under the curve, K-S statistic plot, cumulative gains plot, lift curve, calibration plot, learning curve, and cross-validated box plot.

For the regression algorithms, you learned how to evaluate the performance of a model by using three metrics: the mean squared error, mean absolute error, and root mean squared error.

Finally, for the unsupervised machine learning algorithms, you learned how to evaluate the performance of a model by using the elbow plot.

Congratulations! You have now made it to the end of your...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with scikit-learn Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789343700
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly