Reader small image

You're reading from  Machine Learning with scikit-learn Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343700
Edition1st Edition
Languages
Right arrow
Author (1)
Kevin Jolly
Kevin Jolly
author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly

Right arrow

Predicting Categories with Logistic Regression

The logistic regression algorithm is one of the most interpretable algorithms in the world of machine learning, and although the word "regression" implies predicting a numerical outcome, the logistic regression algorithm is, used to predict categories and solve classification machine learning problems.

In this chapter, you will learn about the following:

  • How the logistic regression algorithm works mathematically
  • Implementing and evaluating your first logistic regression algorithm with scikit-learn
  • Fine-tuning the hyperparameters using GridSearchCV
  • Scaling your data for a potential improvement in accuracy
  • Interpreting the results of the model

Logistic regression has a wide range of applications, especially in the field of finance, where building interpretable machine learning models is key in convincing both investors...

Technical requirements

Understanding logistic regression mathematically

As the name implies, logistic regression is fundamentally derived from the linear regression algorithm. The linear regression algorithm will be discussed in depth in the upcoming chapters. For now, let's consider a hypothetical case in which we want to predict the probability that a particular loan will default based on the loan's interest rate. Using linear regression, the following equation can be constructed:

Default = (Interest Rate × x) + c

In the preceding equation, c is the intercept and x is a coefficient that will be the output from the logistic regression model. The intercept and the coefficient will have numeric values. For the purpose of this example, let's assume c is 5 and x is -0.2. The equation now becomes this:

Default = (Interest Rate × -0.2) + 5

The equation can be represented in a...

Implementing logistic regression using scikit-learn

In this section, you will learn how you can implement and quickly evaluate a logistic regression model for your dataset. We will be using the same dataset that we have already cleaned and prepared for the purpose of predicting whether a particular transaction was fraudulent. In the previous chapter, we saved this dataset as fraud_detection.csv. The first step is to load this dataset into your Jupyter Notebook. This can be done by using the following code:

import pandas as pd

# Reading in the dataset

df = pd.read_csv('fraud_prediction.csv')

Splitting the data into training and test sets

The first step to building any machine learning model with scikit-learn is to...

Fine-tuning the hyperparameters

From the output of the logistic regression model implemented in the preceding section, it is clear that the model performs slightly better than random guessing. Such a model fails to provide value to us. In order to optimize the model, we are going to optimize the hyperparameters of the logistic regression model by using the GridSearchCV algorithm that we used in the previous chapter.

The hyperparameter that is used by the logistic regression model is known as the inverse regularization strength. This is because we are implementing a type of linear regression known as l1 regression. This type of linear regression will explained in detail in Chapter 5, Predicting Numeric Outcomes with Linear Regression.

In order to optimize the inverse regularization strength, or C as it is called in short, we use the following code:

#Building the model with L1...

Scaling the data

Although the model has performed extremely well, scaling the data is still a useful step in building machine learning models with logistic regression, as it standardizes your data across the same range of values. In order to scale your data, we will use the same StandardScaler() function that we used in the previous chapter. This is done by using the following code:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

#Setting up the scaling pipeline

pipeline_order = [('scaler', StandardScaler()), ('logistic_reg', linear_model.LogisticRegression(C = 10, penalty = 'l1'))]

pipeline = Pipeline(pipeline_order)

#Fitting the classfier to the scaled dataset

logistic_regression_scaled = pipeline.fit(X_train, y_train)

#Extracting the score

logistic_regression_scaled.score(X_test, y_test)

The preceding code resulted...

Interpreting the logistic regression model

One of the key benefits of the logistic regression algorithm is that it is highly interpretable. This means that the outcome of the model can be interpreted as a function of the input variables. This allows us to understand how each variable contributes to the eventual outcome of the model.

In the first section, we understood that the logistic regression model consists of coefficients for each variable and an intercept that can be used to explain how the model works. In order to extract the coefficients for each variable in the model, we use the following code:

#Printing out the coefficients of each variable 

print(logistic_regression.coef_)

This results in an output as illustrated by the following screenshot:

The coefficients are in the order in which the variables were in the dataset that was input into the model. In order to extract...

Summary

In this chapter, you have learned how the logistic regression model works on a mathematical level. Although simplistic, the model proves to be formidable in terms of interpretability, which is highly beneficial in the financial industry.

You have also learned how to build and evaluate logistic regression algorithms using scikit-learn, and looked at hyperparameter optimization using the GridSearchCV algorithm. Additionally, you have learned to verify whether the results provided to you by the GridSearchCV algorithm are accurate by plotting the accuracy scores for different values of the hyperparameter.

Finally, you have scaled your data in order make it standardized and learned how to interpret your model on a mathematical level.

In the next chapter, you will learn how to implement tree-based algorithms, such as decision trees, random forests, and gradient-boosted trees...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with scikit-learn Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789343700
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kevin Jolly

Kevin Jolly is a formally educated data scientist with a master's degree in data science from the prestigious King's College London. Kevin works as a statistical analyst with a digital healthcare start-up, Connido Limited, in London, where he is primarily involved in leading the data science projects that the company undertakes. He has built machine learning pipelines for small and big data, with a focus on scaling such pipelines into production for the products that the company has built. Kevin is also the author of a book titled Hands-On Data Visualization with Bokeh, published by Packt. He is the editor-in-chief of Linear, a weekly online publication on data science software and products.
Read more about Kevin Jolly