Packt+ | Advance your knowledge in tech

You're reading from Test Driven Machine Learning

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781784399085

Pages 190 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Table of Contents (16) Chapters

Test-Driven Machine Learning

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Introducing Test-Driven Machine Learning

2. Perceptively Testing a Perceptron

3. Exploring the Unknown with Multi-armed Bandits

4. Predicting Values with Regression

5. Making Decisions Black and White with Logistic Regression

6. You're So Naïve, Bayes

7. Optimizing by Choosing a New Algorithm

8. Exploring scikit-learn Test First

9. Bringing It All Together

Index

Chapter 5. Making Decisions Black and White with Logistic Regression

In the last chapter, we used regression to predict values over a continuous range. In this chapter, we will explore the tuning of a regression model that predicts a binary classification. You are probably already pretty familiar with this method, so we'll spend just time introducing the aspects that we'll be leveraging.

The most important thing about logistic regression is that its form is very different from linear regression. Likewise, interpreting the results is also different and quite confusing. A standard N-variable logistic regression model has the following form:

While in linear regression, the beta coefficient represents the change for every unit of change in the associated x variable. In logistic regression, the betas represent the change in log-odds for every unit's increase in the associated x variable. As a result of this very difference in the model, the way in which we generate data will need to be changed...

Generating logistic data

A critical aspect of test driving our process is being in control. In the last chapter, we fitted a model to a pregenerated set of test data, and tried to guess what the beta coefficients were. In this chapter, we'll start generating a very simple dataset, and then we'll compute the estimates for the coefficients that we'll use. This will help us understand how this all comes together so that we can be sure that we're driving our code in the right direction.

Here is how we can generate some simple data:

import pandas
import statsmodels.formula.api as smf
import numpy as np

observation_count = 1000
intercept = -1.6
beta1 = 0.03
x = np.random.uniform(0, 100, size=observation_count)
x_prime = [np.exp(intercept + beta1 * x_i) / (1 + np.exp(intercept + beta1 * x_i)) for x_i in x]
y = [np.random.binomial(1, x_prime_i, size=1)[0] for x_prime_i in x_prime]
df = pandas.DataFrame({'x':x, 'y':y})

We will sample the data from a binomial distribution, because its values stick between...

Measuring model accuracy

So, we know how to create a model that is "adequate", but what does this really mean? How can we differentiate whether one "adequate" model is better than another? A common approach is to compare the ROC curves. This one is generated from the simple model that we just created:

You're probably familiar with ROC curves. They show us what kind of true positive rate we can achieve by allowing a given error rate in terms of false positives. The basic take away is that we want the curve to get as close to the upper left corner as possible. In case you haven't used these visualizations before, the reason for this is that the more the line is pulled up and to the left, the fewer false positives we get for every true positive. It maps very much to the concept of an error rate.

We have a visualization, which is great, but we can't automatically test it. We need to find some way to quantify this phenomenon. There is a simple, pretty straightforward way. It's called an Area Under...

Generating a more complex example

Up until now, we've been looking at a very simple set of data. Next, we'll be generating a much more complicated example. To model it, we'll be applying the techniques from the last chapter to build a solid model using TDD.

Unlike the last time, let's build the data generation code first, and use it so that it can help us understand our model building process more deeply. Here is the data generator that we'll use for the remainder of this chapter:

import pandas
import statsmodels.formula.api as smf
import numpy as np

def generate_data():
    observation_count = 1000
    intercept = -1.6
    beta1 = -0.03
    beta2 = 0.1
    beta3 = -0.15
    variable_a = np.random.uniform(0, 100, size=observation_count)
    variable_b = np.random.uniform(50, 75, size=observation_count)
    variable_c = np.random.uniform(3, 10, size=observation_count)
    variable_d = np.random.uniform(3, 10, size=observation_count)
    variable_e = np.random.uniform(11, 87, size=observation_count...

Test driving our model

To start with now, we must create the framework for scoring our model in a test. It will look like the following:

import pandas
import sklearn.metrics
import statsmodels.formula.api as smf
import numpy as np

def logistic_regression_test():
  df = pandas.DataFrame.from_csv('./generated_logistic_data.csv')
  generated_model = smf.logit('y ~ variable_d', df)
  generated_fit = generated_model.fit()
  roc_data = sklearn.metrics.roc_curve(df['y'], generated_fit.predict(df))
  auc = sklearn.metrics.auc(roc_data[0], roc_data[1])
  print generated_fit.summary()
  print "AUC score: {0}".format(auc)
  assert auc > .6, 'AUC should be significantly above random'

The previous code also includes a first stab at a model. Because we generated the data, we know that variable_d is completely unhelpful, but it makes this a bit more of an interesting exploration.

When we run the previous code, the test fails, as expected. I have the test set up to give the full statistical summary, as...

Summary

In this chapter, we reviewed logistic regression and different measures of quality. We figured out how to quantify the typically qualitative measures of quality, and then we used them to drive us through a model building process test first.

In the next chapter, we'll continue exploring classification by looking at one of the most straightforward techniques that we'll learn about the Naïve Bayes classification.