Chapter 5. Making Decisions Black and White with Logistic Regression
In the last chapter, we used regression to predict values over a continuous range. In this chapter, we will explore the tuning of a regression model that predicts a binary classification. You are probably already pretty familiar with this method, so we'll spend just time introducing the aspects that we'll be leveraging.
The most important thing about logistic regression is that its form is very different from linear regression. Likewise, interpreting the results is also different and quite confusing. A standard N-variable logistic regression model has the following form:
While in linear regression, the beta coefficient represents the change for every unit of change in the associated x variable. In logistic regression, the betas represent the change in log-odds for every unit's increase in the associated x variable. As a result of this very difference in the model, the way in which we generate data will need to be changed...
A critical aspect of test driving our process is being in control. In the last chapter, we fitted a model to a pregenerated set of test data, and tried to guess what the beta coefficients were. In this chapter, we'll start generating a very simple dataset, and then we'll compute the estimates for the coefficients that we'll use. This will help us understand how this all comes together so that we can be sure that we're driving our code in the right direction.
Here is how we can generate some simple data:
We will sample the data from a binomial distribution, because its values stick between...
So, we know how to create a model that is "adequate", but what does this really mean? How can we differentiate whether one "adequate" model is better than another? A common approach is to compare the ROC curves. This one is generated from the simple model that we just created:
You're probably familiar with ROC curves. They show us what kind of true positive rate we can achieve by allowing a given error rate in terms of false positives. The basic take away is that we want the curve to get as close to the upper left corner as possible. In case you haven't used these visualizations before, the reason for this is that the more the line is pulled up and to the left, the fewer false positives we get for every true positive. It maps very much to the concept of an error rate.
We have a visualization, which is great, but we can't automatically test it. We need to find some way to quantify this phenomenon. There is a simple, pretty straightforward way. It's called an Area Under...
Generating a more complex example
Up until now, we've been looking at a very simple set of data. Next, we'll be generating a much more complicated example. To model it, we'll be applying the techniques from the last chapter to build a solid model using TDD.
Unlike the last time, let's build the data generation code first, and use it so that it can help us understand our model building process more deeply. Here is the data generator that we'll use for the remainder of this chapter:
To start with now, we must create the framework for scoring our model in a test. It will look like the following:
The previous code also includes a first stab at a model. Because we generated the data, we know that variable_d
is completely unhelpful, but it makes this a bit more of an interesting exploration.
When we run the previous code, the test fails, as expected. I have the test set up to give the full statistical summary, as...
In this chapter, we reviewed logistic regression and different measures of quality. We figured out how to quantify the typically qualitative measures of quality, and then we used them to drive us through a model building process test first.
In the next chapter, we'll continue exploring classification by looking at one of the most straightforward techniques that we'll learn about the Naïve Bayes classification.