Reader small image

You're reading from  scikit-learn Cookbook - Second Edition

Product typeBook
Published inNov 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787286382
Edition2nd Edition
Languages
Right arrow
Author (1)
Trent Hauck
Trent Hauck
author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck

Right arrow

Linear Models – Logistic Regression

In this chapter, we will cover the following recipes:

  • Loading data from the UCI repository
  • Viewing the Pima Indians diabetes dataset with pandas
  • Looking at the UCI Pima Indians dataset web page
  • Machine learning with logistic regression
  • Examining logistic regression errors with a confusion matrix
  • Varying the classification threshold in logistic regression
  • Receiver operating characteristic – ROC analysis
  • Plotting an ROC curve without context
  • Putting it all together – UCI breast cancer dataset

Introduction

Linear regression is a very old method and part of traditional statistics. Machine learning linear regression involves a training and testing set. This way, it can be compared by utilizing cross-validation with other models and algorithms. Traditional linear regression trains and tests on the whole dataset. This is still a common practice, possibly because linear regression tends to underfit rather than overfit.

Using linear methods for classification – logistic regression

As seen in Chapter 1, High-Performance Machine Learning – NumPy, logistic regression is a classification method. In some contexts, it is a regressor as it computes the real number probability of a class before assigning a categorical...

Loading data from the UCI repository

The first dataset we will load is the Pima Indians diabetes dataset. This will require access to the internet. The dataset is available thanks to Sigillito V. (1990), UCI machine learning repository (https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data), Laurel, MD at Johns Hopkins University, applied physics laboratory.

The first thing in your mind if you are an open source veteran is, what is the license/permission to this database? This is a very important issue. The UCI repository has a use policy that requires citation of the database whenever we are using it. We are allowed to use it but we must give them proper credit for their great help and provide a citation.

How to do it...

...

Viewing the Pima Indians diabetes dataset with pandas

How to do it...

  1. You can view the data in various ways. View the top of the dataframe:
all_data.head()
  1. Nothing seems amiss here, except possibly an insulin level of zero. Is this possible? What about the skin_mm variable? Can that be zero? Make a note about it as a comment in your IPython:
#Is an insulin level of 0 possible? Is a skin_mm of 0 possible?
  1. Get a rough overview of the dataframe with the describe() method:
all_data.describe()
  1. Make a note again in your notebook about additional zeros:
#The features plasma_con, blood_pressure, skin_mm, insulin, bmi have 0s as values. These values could be physically impossible.
  1. Draw a histogram of the pregnancy_x variable...

Looking at the UCI Pima Indians dataset web page

We did some exploratory analysis to get a rough understanding of the data. Now we will read the UCI Pima Indians dataset documentation.

How to do it...

View the citation policy

Machine learning with logistic regression

You are familiar with the steps of training and testing a classifier. With logistic regression, we will do the following:

  • Load data into feature and target arrays, X and y, respectively
  • Split the data into training and testing sets
  • Train the logistic regression classifier on the training set
  • Test the performance of the classifier on the test set

Getting ready

Define X, y – the feature and target arrays

Let's start predicting with scikit-learn's logistic regression. Perform the necessary imports and set the input...

Examining logistic regression errors with a confusion matrix

Getting ready

Import and view the confusion matrix for the logistic regression we constructed:

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred,labels = [1,0])

array([[27, 27],
[12, 88]])

I passed three arguments to the confusion matrix:

  • y_test: The test target set
  • y_pred: Our logistic regression predictions
  • labels: References to a positive class

The labels = [1,0] means that the positive class is 1 and the negative class is 0. In the medical context, we found while exploring the Pima Indians diabetes dataset that class 1 tested positive for diabetes.

Here is the confusion matrix, again in pandas dataframe form:

...

Varying the classification threshold in logistic regression

Getting ready

We will use the fact that underlying the logistic regression classification, there is regression to minimize the number of times people were sent home for not having diabetes although they do. Do so by calling the predict_proba() method of the estimator:

y_pred_proba = lr.predict_proba(X_test)

This yields an array of probabilities. View the array:

y_pred_proba

array([[ 0.87110309, 0.12889691],
[ 0.83996356, 0.16003644],
[ 0.81821721, 0.18178279],
[ 0.73973464, 0.26026536],
[ 0.80392034, 0.19607966], ...

In the first row, a probability of about 0.87 is assigned to class 0 and a probability of 0.13 is assigned to 1. Note that...

Receiver operating characteristic – ROC analysis

Along the same lines of examining NPV, there are standard measures that examine cells within a confusion matrix.

Getting ready

Sensitivity

Sensitivity, like NPV in the previous section, is a mathematical function of the confusion matrix cells. Sensitivity is the proportion of people who took the test with a condition and were correctly labeled as having the condition, diabetes in this case:

Mathematically, it is the ratio of patients correctly labeled as having a condition (TP) divided by the total number of...

Plotting an ROC curve without context

How to do it...

An ROC curve is a diagnostic tool for any classifier without any context. No context means that we do not know yet which error type (FP or FN) is less desirable yet. Let us plot it right away using a vector of probabilities, y_pred_proba[:,1]:

from sklearn.metrics import roc_curve

fpr, tpr, ths = roc_curve(y_test, y_pred_proba[:,1])
plt.plot(fpr,tpr)

The ROC is a plot of the FPR (false alarms) in the x axis and TPR (finding everyone with the condition who really has it) in the y axis. Without context, it is a tool to measure classifier performance.

Perfect classifier

...

Putting it all together – UCI breast cancer dataset

How to do it...

The dataset is provided thanks to Street, N (1990), UCI machine learning repository (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data), Madison, WI: University of Wisconsin, computer sciences department:

  1. After reading the citation/license information, load the dataset from UCI:
import numpy as np
import pandas as pd
data_web_address = data_web_address = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
column_names = ['radius',
'texture',
'perimeter',
...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
scikit-learn Cookbook - Second Edition
Published in: Nov 2017Publisher: PacktISBN-13: 9781787286382
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Trent Hauck

Trent Hauck is a data scientist living and working in the Seattle area. He grew up in Wichita, Kansas and received his undergraduate and graduate degrees from the University of Kansas. He is the author of the book Instant Data Intensive Apps with pandas How-to, Packt Publishing—a book that can get you up to speed quickly with pandas and other associated technologies.
Read more about Trent Hauck