Chapter 6. You're So Naïve, Bayes
We've all seen examples of using Naïve Bayes classifiers in a way of classifying text. The applications include spam detection, sentiment analysis, and more. In this chapter, we're going to take a road that is less traveled. We will build a Naïve Bayes classifier that can take in continuous inputs and classify them. Specifically, we'll build a Gaussian Naïve Bayes classifier to classify which state a person is from, which will be based on the person's height, weight, and BMI.
This chapter will work a bit differently from the previous ones. Here, we'll develop an N-class Gaussian Naïve Bayes classifier to fit our use case (the data at hand). In the next chapter, we'll pull in some of this data to train with, and then we'll analyze the quality of our model to see how we did it. In the previous chapters, we used generated data so that we could make sure that the classifiers built by us were operating according to their assumptions. In this chapter, we'll spend...
Gaussian classification by hand
Since the Gaussian Naïve Bayes classifier is less common, let's discuss it a bit more before diving in. The Gaussian Naïve Bayes algorithm works by taking in values that are continuous, and by assuming that they are all independent and that each variable follows a Gaussian (or Normal) distribution. It may not be obvious how a probability follows from this, so let's look at a concrete example.
Let's say that I give you five weights from the female test subjects and five weights from the male test subjects. Next, I want to give you a weight from a test subject of an unknown gender, and have you guess whether it's a man or woman. Using a Gaussian classifier, we can approach this problem by first defining an underlying Gaussian model for both, female and male observations (two models in total). A Gaussian model is specified using a mean and variance. Let's step through this with some numbers.
Let's assume that the following data is provided:
Beginning the development
We start with the standard simplistic tests that will serve to get the basic wiring up for our classifier. First, the test:
And then the code:
As the next step to approach a solution, let's try the case where we've only observed the data from a single class:
A very simple solution is to just set a single classification that gets set every time we train something:
In this chapter, we built up a Gaussian Naïve Bayes classifier, and ran into our first examples of truly necessary refactoring. We also saw how needing to make enormous changes in the code for a test is sometimes the result of trying to test too many concepts at once. We saw how backing up and rethinking test design can ultimately lead to a better and more elegantly designed piece of software as well.
In the next chapter, we'll apply this classifier to the real data, and see what it looks like to compare how different classifiers perform on the same data.