Packt+ | Advance your knowledge in tech

You're reading from Test Driven Machine Learning

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781784399085

Pages 190 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Table of Contents (16) Chapters

Test-Driven Machine Learning

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Introducing Test-Driven Machine Learning

2. Perceptively Testing a Perceptron

3. Exploring the Unknown with Multi-armed Bandits

4. Predicting Values with Regression

5. Making Decisions Black and White with Logistic Regression

6. You're So Naïve, Bayes

7. Optimizing by Choosing a New Algorithm

8. Exploring scikit-learn Test First

9. Bringing It All Together

Index

Chapter 4. Predicting Values with Regression

In this chapter, we'll cover multiple linear regression and how to approach it from a TDD perspective. Unlike the previous chapters, where we developed the actual algorithm using TDD, in this chapter we will explore using a third-party library for the algorithm and TDD building our model. In order to do this, we'll need to find a way to quantify model quality as well as to quantify model assumption violations. We won't have the liberty of checking a data visualization to ensure that our model fits our criteria well.

We will also be using the Python packages statsmodels and pandas, so install those before moving forward in the chapter, using the following commands;

> pip install pandas
> pip install statsmodels

To start off, let's refresh ourselves on multiple regression and the key topics we'll need to drive us toward an excellent model.

Refresher on advanced regression

Before we get down to brass tacks on how we will tackle building regression models using TDD, we need to refresh ourselves on some of the finer points. Multiple regression comes packed with some assumptions and different measures of model quality. A good amount of this information can be found in A Second Course in Statistics Regression Analysis, Mendenhall & Sincich, Pearson (2011).

Regression assumptions

When a lot of people are introduced to regression, their main take-away is this is how we draw a line through our data to predict what it will be. To be fair, that's pretty accurate, but there's a fair amount of nuance in this that we need to explicitly discuss.

First let's discuss the standard multiple regression model form. It looks like this:

Here y is our dependent variable. Every x variable is an independent variable. y being a dependent variable means it is dependent on the values of the independent variables and the error term . The error term is...

Generating our own data

When exploring machine learning algorithms, it can be quite helpful to generate your own data. This gives you complete control and allows for the most exploration of a new technique you might try. It also lets you build trust that your model is working as planned given your assumptions. You've seen this multiple times already in this book up to this point, so it's nothing new. As we develop a linear regression model however, it will be even more instructive since I'm going to work backward through the example.

I will generate data first but show you how I generated the data at the end of the chapter. The goal here is to give you the opportunity to work through building a complex model from a statistical test-first perspective and ultimately show how the generating function was defined and how that affected our work.

The generated data is in the GitHub repo for this book (https://github.com/jcbozonier/Machine-Learning-Test-by-Test) so that you can follow along with the...

Building the foundations of our model

Let's start by pulling the model into Python and transforming it into a form that we can use. To do this, we will need two additional libraries. We will use Pandas to read from our generated CSV and statsmodel to run our statistical procedures. Both libraries are pretty powerful and full of features, and we will only be touching on a few of them so feel free to explore them further later.

To start off, let's make a test that will run a simple regression over one of the variables and show us the output. That should give us a good place to start. I'm keeping this in a unit testing structure because I know I want to test this code and just want to explore a bit to know exactly what to test for. This first step you could do in a one-off file, but I'm choosing to start with it so I can build from it:

import pandas
import statsmodels.formula.api as sm
import nose.tools as nt

def vanilla_model_test():
  df = pandas.read_csv('./generated_data.csv')
  model_fit...

Cross-validating our model

Now before we cheat and look at our answer key, let's see how well this solution does at predicting data it hasn't seen. To do this, I write the following fairly large test:

def final_model_cross_validation_test():
  df = pandas.read_csv('./generated_data.csv')
  df['predicted_dependent_var'] = 25.6266 \
                                + 2.7083*df['ind_var_a'] \
                                - 1.5527*df['ind_var_b'] \
                                - 0.3917*df['ind_var_c'] \
                                - 0.2006*df['ind_var_e'] \
                                + 5.6450*df['ind_var_b'] * df['ind_var_c']
  df['diff'] = (df['dependent_var'] - df['predicted_dependent_var']).abs()
  print df['diff']
  print '==========='
  cv_df = pandas.read_csv('./generated_data_cv.csv')
  cv_df['predicted_dependent_var'] = 25.6266 \
                                + 2.7083*cv_df['ind_var_a'] \
                                - 1.5527*cv_df['ind_var_b'] \
                  ...

Generating data

Now that we've gone through the process of searching for the right model, let's talk about what the model's true parameters were and how they line up with the parameters our regression generated.

This is the code that was used to generate the data:

import numpy as np

variable_a = np.random.uniform(-100, 100, 30)
variable_b = np.random.uniform(-5, 5, 30)
variable_c = np.random.uniform(0, 37, 30)
variable_d = np.random.uniform(121, 213, 30)
variable_e = np.random.uniform(-1000, 100, 30)
variable_f = np.random.uniform(-100, 100, 30)
variable_g = np.random.uniform(-25, 75, 30)
variable_h = np.random.uniform(1, 27, 30)

independent_variables = zip(variable_a, variable_b, variable_c, variable_d, variable_e, variable_f, variable_g, variable_h)
dependent_variables = [3*x[0] - 2*x[1] - .25*x[4] + 5.75*x[1]*x[2] + np.random.normal(0, 50) for x in  independent_variables]

full_dataset = [x[0] + (x[1],) for x in zip(independent_variables, dependent_variables)]

import csv
with open('generated_data...

Summary

In this chapter, we stepped through what it takes to drive building a multiple regression model using the same unit test techniques we've been using to develop code.

In the next chapter, we will continue with exploring regression but will move on to logistic regression. Rather than predicting values, we'll use logistic regression to classify data into one group or another.