Test-Driven Machine Learning

5 (3 reviews total)
By Justin Bozonier
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introducing Test-Driven Machine Learning

About this book

Machine learning is the process of teaching machines to remember data patterns, using them to predict future outcomes, and offering choices that would appeal to individuals based on their past preferences.

Machine learning is applicable to a lot of what you do every day. As a result, you can’t take forever to deliver your first iteration of software. Learning to build machine learning algorithms within a controlled test framework will speed up your time to deliver, quantify quality expectations with your clients, and enable rapid iteration and collaboration.

This book will show you how to quantifiably test machine learning algorithms. The very different, foundational approach of this book starts every example algorithm with the simplest thing that could possibly work. With this approach, seasoned veterans will find simpler approaches to beginning a machine learning algorithm. You will learn how to iterate on these algorithms to enable rapid delivery and improve performance expectations.

The book begins with an introduction to test driving machine learning and quantifying model quality. From there, you will test a neural network, predict values with regression, and build upon regression techniques with logistic regression. You will discover how to test different approaches to naïve bayes and compare them quantitatively, along with how to apply OOP (Object-Oriented Programming) and OOP patterns to test-driven code, leveraging SciKit-Learn.

Finally, you will walk through the development of an algorithm which maximizes the expected value of profit for a marketing campaign by combining one of the classifiers covered with the multiple regression example in the book.

Publication date:
November 2015
Publisher
Packt
Pages
190
ISBN
9781784399085

 

Chapter 1. Introducing Test-Driven Machine Learning

This book will show you how to develop complex software (sometimes rooted in randomness) in small, controlled steps . It will also instruct you in how to begin developing solutions to machine learning problems using test-driven development (from here, this will be written as TDD). Mastering TDD is not something this book will achieve. Instead, this book will help you begin your journey and expose you to guiding principles, which you can use to creatively solve challenges as you encounter them.

We will answer the following three questions in this chapter:

  • What are TDD and BDD (behavior-driven development)?

  • How do we apply these concepts to machine learning, and make inferences and predictions?

  • How does this work in practice?

After gaining answers to these questions, we will be ready to move on to tackling real problems. This book is about applying these concepts to solve machine learning problems. This chapter contains the largest theoretical explanation that we will see in the book, with the remainder of the theory being described by example.

Due to the focus on application, you will learn much more than simply the theory of TDD and BDD. However, there are aspects of practices that this book will not touch on. To read more about the theory and ideas, search the Internet for articles written by the following:

  • Kent Beck—The father of TDD

  • Dan North—The father of BDD

  • Martin Fowler—The father of refactoring. He has also created a large knowledge base on these topics

  • James Shore—One of the authors of The Art of Agile Development, who has a deep theoretical understanding of TDD, and explains the practical value of it quite well

These concepts are incredibly simple and yet can take a lifetime to master. When applied to machine learning, we must find new ways to control and/or measure the random processes inherent in the algorithm. This will come up in this chapter as well as others. In the next section, we will develop a foundation for TDD and begin to explore its application.

 

Test-driven development


Kent Beck wrote in his seminal book on the topic that TDD consists of only two specific rules, which are as follows:

  • Don't write a line of new code unless you first have a failing automated test

  • Eliminate duplication

This, as he notes fairly quickly, leads us to a mantra, really the mantra of TDD: "Red, Green, Refactor."

If this is a bit abstract, let me restate that TDD is a software development process that enables a programmer to write code that specifies the intended behavior before writing any software to actually implement the behavior. The key value of TDD is that at each step of the way, you have working software as well as an itemized set of specifications.

TDD is a software development process that requires the following:

  • The writing of code to detect the intended behavioral change.

  • A rapid iteration cycle that produces working software after each iteration.

  • Clear definitions of what a bug is. If a test is not failing but a bug is found, it is not a bug. It is a new feature.

Another point that Kent makes is that ultimately, this technique is meant to reduce fear in the development process. Each test is a checkpoint along the way to your goal. If you stray too far from the path and wind up in trouble, you can simply delete any tests that shouldn't apply, and then work your code back to a state where the rest of your tests pass. There's a lot of trial and error inherent in TDD, but the same applies to machine learning.

As a result, this whole process changes our minds. The software that you design using TDD will also be modular enough to be able to have different components swapped in and out of your pipeline. We will see more of this in the later chapters of this book.

You might be thinking that just thinking through test cases is equivalent to TDD. If you are like most people, what you write is different from what you might verbally say, and very different from what you think. By writing the intent of our code before we write our code, it applies a pressure to the software design that prevents you from writing "just in case" code. By this I mean the code that we write just because we aren't sure if there will be a problem. Using TDD, we think of a test case, prove that it isn't supported currently, and then fix it. If we can't think of a test case, we then don't add code.

TDD can and does operate at many different levels of the software under development. Tests can be written against functions and methods, entire classes, programs, web services, neural networks, random forests, and whole machine learning pipelines. At each level, the tests are written from the perspective of the prospective client. How does this relate to machine learning? Let's take a step back and reframe what I just said.

In the context of machine learning, tests can be written against functions, methods, classes, mathematical implementations, and all the machine learning algorithms. TDD can even be used to explore technique and methods in a very directed and focused manner, much like you might use a REPL (an interactive shell where you can try out snippets of code) or interactive Python (or IPython) sessions.

 

The TDD cycle


The TDD cycle consists of writing a small function in the code that attempts to do something that we haven't programmed yet. These small test methods will have three main sections: the first section is where we set up our objects or test data; the second section is where we invoke the code that we're testing; and the last section is where we validate that what happened is what we thought would happen. You will write all sorts of lazy code to get your tests to pass. If you are doing it right, then someone who is watching you should be appalled at your laziness and tiny steps. After the test goes green, you have an opportunity to refactor your code to your heart's content. In this context, "refactor" refers to changing how your code is written, but not changing how it behaves.

Let's examine more deeply the three steps of TDD: Red, Green, and Refactor.

Red

First, create a failing test. Of course, this implies that you know what failure looks like in order to write the test. At the highest level in machine learning, this might be a baseline test where baseline is a "better than random" test. It might even be "predicts random things", or even simpler "always predicts the same thing". Is this terrible? Perhaps, it is to some who are enamored with the elegance and artistic beauty of his/her code. Is it a good place to start though? Absolutely. A common issue that I have seen in machine learning is spending so much time up front, implementing The One True Algorithm that hardly anything ever gets done. Getting to outperform pure randomness, though, is a useful change that can start making your business money as soon as it's deployed.

Green

After you have established a failing test, you can start working to get it green. If you start with a very high-level test, you may find that it helps to conceptually break that test up into multiple failing tests that are lower-level concerns. I'll dive deeper into this later on in this chapter, but for now, just know that if you want to get your test passing as soon as possible, lie, cheat, and steal to get there. I promise that cheating actually makes your software's test suite that much stronger. Resist the urge to write the software in an ideal fashion. Just slap something together. You will be able to fix the issues in the next step.

Refactor

You got your test to pass through all manner of hackery. Now you get to refactor your code. Note that it is not to be interpreted loosely. Refactor specifically means to change your software without affecting its behavior. If you add the if clauses, or any other special handling, you are no longer refactoring. Next, you write the software without tests. One way you will know for sure that you are no longer refactoring is if you've broken previously passing tests. If this happens, we back up our changes until our tests pass again. It may not be obvious, but this isn't all that it takes for you to know that you haven't changed behavior. Read Refactoring: Improving the Design of Existing Code, Martin Fowler for you to understand how much you should really care for refactoring. In his illustration in this book, refactoring code becomes a set of forms and movements, not unlike karate katas.

This is a lot of general theory, but what does a test actually look like? How does this process flow in a real problem?

 

Behavior-driven development


BDD is the addition of business concerns to the technical concerns more typical of TDD. This came about as people became more experienced with TDD. They started noticing some patterns in the challenges that they were facing. One especially influential person, Dan North, proposed some specific language and structure to ease some of these issues. The following are some of the issues he noticed:

  • People had a hard time understanding what they should test next.

  • Deciding what to name a test could be difficult.

  • How much to test in a single test always seemed arbitrary.

Now that we have some context, we can define what exactly BDD is. Simply put, it's about writing our tests in such a way that they will tell us the kind of behavior change they affect. A good litmus test might be asking oneself if the test you are writing would be worth explaining to a business stakeholder. How this solves the previous problem may not be completely obvious, but it may help to illustrate what this looks like in practice. It follows a structure of "Given, When, Then". Committing to this style completely can require specific frameworks or a lot of testing ceremony. As a result, I loosely follow this in my tests, as you will see soon. Here's a concrete example of a test description written in this style: "Given an empty dataset when the classifier is trained, it should throw an invalid operation exception".

This sentence probably seems like a small enough unit of work to tackle, but notice that it's also a piece of work that any business user who is familiar with the domain that you're working in, would understand and have an opinion on.

You can read more about Dan North's point of view in this article on his website at http://dannorth.net/introducing-bdd/.

The BDD adherents tend to use specialized tools to make the language and test result reports be as accessible to business stakeholders as possible. In my experience and from my discussions with others, this extra elegance is typically used so little that it doesn't seem worthwhile. The approach you will learn in this book will take a simplified first approach to make it as easy as possible for someone with zero background to get up to speed.

With this in mind, let's work through an example.

 

Our first test


Let's start with an example of what a test looks like in Python. We will be using nosetests throughout this book. The main reason for using this is that while it is a bit of a pain to install a library, this library in particular will make everything that we do much simpler. The default unit test solution in Python requires a heavier set up. On top of this, by using nose, we can always mix in tests that use the built-in solution when we find that we need the extra features.

First, install it like this:

pip install nose

If you have never used pip before then it is time for you to know that it is a very simple way to install new Python libraries.

Now, as a hello world style example, let's pretend that we're building a class that will guess a number using the previous guesses to inform it. This is the simplest example to get us writing some code. We will use the TDD cycle that we discussed previously, and write our first test in painstaking detail. After we get through our first test and have something concrete to discuss, we will talk about the anatomy of the test that we wrote.

First, we must write a failing test. The simplest failing test that I can think of is the following:

def given_no_information_when_asked_to_guess_test():
  number_guesser = NumberGuesser()
  result = number_guesser.guess()
  assert result is None, "Then it should provide no result."

The context for the assert is in the test name. Reading the test name and then the assert name should do a pretty good job of describing what is being tested. Notice that in my test, I instantiate a NumberGuesser object. You're not missing any steps, this class doesn't exist yet. This seems roughly like how I'd want to use it. So, it's a great place to start, since it doesn't exist, wouldn't you expect this test to fail? Let's test this hypothesis.

To run the test, first make sure your test file is saved so that it ends in _tests.py. From the directory with the previous code, just run the following:

nosetests

When I do this, I get the following result:

There's a lot going on here, but the most informative part is near the end. The message is saying that NumberGuesser does not exist yet, which is exactly what I expected since we haven't actually written the code yet. Throughout the book, we'll reduce the detail of the stack traces that we show. For now, we'll keep things detailed to make sure that we're on the same page. At this point, we're in the "red" state of the TDD cycle:

  1. Next, create the following class in a file named NumberGuesser.py:

    class NumberGuesser:
      """Guesses numbers based on the history of your input"""
  2. Import the new class at the top of my test file with a simple import NumberGuesser statement.

  3. When I rerun nosetests, I get the following:

    TypeError: 'module' object is not callable

    Oh whoops! I guess that's not the right way to import the class. This is another very tiny step, but what is important is that we are making forward progress through constant communication with our tests. We are going through extreme detail because I can't stress this point enough. I will stop being as deliberate with this in the following chapter, bear with me for the time being.

  4. Change the import statement to the following:

    from NumberGuesser import NumberGuesser
  5. Rerun nosetests and you will see the following:

    AttributeError: NumberGuesser instance has no attribute 'guess'
  6. The error message has changed, and is leading to the next thing that needs to be changed. From here, we just implement what we think we need for the test to pass:

    class NumberGuesser:
      """Guesses numbers based on the history of your input"""
      def guess(self):
        return None
  7. On rerunning the nosetests, we'll get the following result:

That's it! Our first successful test! Some of these steps seem so tiny so as to not be worthwhile. Indeed, overtime, you may decide that you prefer to work on a different level of detail. For the sake of argument, we'll be keeping our steps pretty small, if only to illustrate just how much TDD keeps us on track and guides us on what to do next. We all know how to write the code in very large, uncontrolled steps. Learning to code surgically requires intentional practice, and is worth doing explicitly. Let's take a step back and look at what this first round of testing took.

The anatomy of a test

Starting from a higher level, notice how I had a dialog with Python. I just wrote the test and Python complained that the class that I was testing didn't exist. Next, I created the class, but then Python complained that I didn't import it correctly. So, then I imported it correctly, and Python complained that my "guess" method didn't exist. In response, I implemented the way that my test expected, and Python stopped complaining.

This is the spirit of TDD. There is a conversation between yourself and your system. You can work in steps as little or as large as you're comfortable with. What I did previously could've been entirely skipped over, though the Python class could have been written and imported correctly the first time. The longer you go without "talking" to the system, the more likely you are to stray from the path of getting things working as simply as possible.

Let's zoom in a little deeper and dissect this simple test to see what makes it tick. Here is the same test, but I've commented it, and broken it into sections that you will see recurring in every test that you write:

def given_no_information_when_asked_to_guess_test():
  # given
  number_guesser = NumberGuesser()
  # when
  guessed_number = number_guesser.guess()
  # then
  assert guessed_number is None, 'there should be no guess.'

Given

This section sets up the context for the test. In the previous test, you saw that I didn't provide any prior information to the object. In many of our machine learning tests, this will be the most complex portion of our test. We will be importing certain sets of data, sometimes making a few specific issues in the data and testing our software to handle the details that we would expect. When you think about this section of your tests, try to frame it as "Given this scenario…". In our test, we might say "Given no prior information for NumberGuesser…".

When

This should be one of the simplest aspects of our test. Once you've set up the context, there should be a simple action that triggers the behavior that you want to test. When you think about this section of your tests, try to frame it as "When this happens…". In our test we might say "When NumberGuesser guesses a number…".

Then

This section of our test will check on the state of our variables and any returned results, if applicable. Again, this section should also be fairly straightforward, as there should be only a single action that causes a change to your object under the test. The reason for this is that if it takes two actions to form a test, then it is very likely that we will just want to combine the two into a single action that we can describe in terms that are meaningful in our domain. A key example may be loading the training data from a file and training a classifier. If we find ourselves doing this a lot, then why not just create a method that loads data from a file for us?

As we progress through this book, you will find examples where we'll have the helper functions help us determine whether our results have changed in certain ways. Typically, we should view these helper functions as code smells. Remember that our tests are the first applications of our software. Anything that we have to build in addition to our code, to understand the results, is something that we should probably (there are exceptions to every rule) just include in the code we are testing.

"Given, When, Then" is not a strong requirement of TDD, because our previous definition of TDD only consisted of two things (all that the code requires is a failing test first and to eliminate duplication). We will still follow this convention in this book because:

  • Following some conventions throughout the book will make it much more readable.

  • It is the culmination of the thoughts of many people who were beginning to see patterns in how they were using TDD. This is a technique that has changed how I approach testing, so I use it here.

It's a small thing to be passionate about and if it doesn't speak to you, just translate this back into "Arrange, Act, Assert" in your head. At the very least, consider it as well as why these specific, very deliberate words are used.

 

TDD applied to machine learning


At this point, you maybe wondering how TDD will be used in machine learning, and whether we use it on regression or classification problems. In every machine learning algorithm there exists a way to quantify the quality of what you're doing. In the linear regression it's your adjusted R2 value; in classification problems it's an ROC curve (and the area beneath it) or a confusion matrix, and more. All of these are testable quantities. Of course, none of these quantities have a built-in way of saying that the algorithm is good enough.

We can get around this by starting our work on every problem by first building up a completely naïve and ignorant algorithm. The scores that we get for this will basically represent a plain, old, and random chance. Once we have built an algorithm that can beat our random chance scores, we just start iterating, attempting to beat the next highest score that we achieve. Benchmarking algorithms is an entire field in its own right that can be delved into more deeply.

In this book, we will implement a naïve algorithm to get a random chance score, and we will build up a small test suite that we can then use to pit this model against another. This will allow us to have a conversation with our machine learning models in the same manner as we had with Python earlier.

For a professional machine learning developer, it's quite likely that the ideal metric to test is a profitability model that compares risk (monetary exposure) to expected value (profit). This can help us keep a balanced view of how much error and what kind of error we can tolerate. In machine learning, we will never have a perfect model, and we can search for the rest of our lives for "the best" model. By finding a way to work your financial assumptions into the model, we will improve our ability to decide between the competing models. We will definitely touch on this topic throughout the book, so it's good to keep it in mind.

 

Dealing with randomness


Dealing with randomness in algorithms can be a huge mental block for some people when they try to understand how they might use TDD. TDD is so deterministic, intentional, and controlled that your initial gut reaction to introducing a random process may be to think that it makes TDD impossible. This is a place where TDD actually shines though. Here's how.

Let's pick up where we left off on the simplistic NumberGuesser from earlier. We're going to add a requirement so that it will randomly choose numbers that the user has guessed, but will also weigh for what is most likely.

To get there, I first have the NumberGuesser guess whatever the previous number was revealed to be every time I ask for a guess. The test for this looks like the following:

def given_one_datapoint_when_asked_to_guess_test():
  #given
  number_guesser = NumberGuesser()
  previously_chosen_number = 5
  number_guesser.number_was(previously_chosen_number)
  #when
  guessed_number = number_guesser.guess()
  #then
  assert type(guessed_number) is int, 'the answer should be a number'
  assert guessed_number == previously_chosen_number, 'the answer should be the previously chosen number.'

It's a simple test that ultimately just requires us to set a variable value in our class. The behavior of predicting on the basis of the last previous input can be valuable. It's the simplest prediction that we can start with.

If you run your tests here, you will see them fail. This is what my code looks like after getting this to pass:

class NumberGuesser:
  """Guesses numbers based on the history of your input"""
  def __init__(self):
    self._guessed_numbers = None
  def number_was(self, guessed_number):
    self._guessed_number = guessed_number
  def guess(self):
    return self._guessed_number

Upon making this test pass, we can review it for any refactoring opportunities. It's still pretty simple, so let's keep going. Next, I will have NumberGuesser randomly choose from all of the numbers that were previously guessed, instead of just the last previous guess. I will start with making sure that the guessed number is the one that I've seen before:

def given_two_datapoints_when_asked_to_guess_test():
  #given
  number_guesser = NumberGuesser()
  previously_chosen_numbers = [1,2,5]
  number_guesser.numbers_were(previously_chosen_numbers)
  #when
  guessed_number = number_guesser.guess()
  #then
  assert guessed_number in previously_chosen_numbers, 'the guess should be one of the previously chosen numbers'

Running this test now will cause a new failure. While thinking about the laziest way of getting this test to work, I realized that I can cheat big time. All I need to do is create my new method, and take the first element in the list:

class NumberGuesser:
  """Guesses numbers based on the history of your input"""
  def __init__(self):
    self._guessed_numbers = None
  def numbers_were(self, guessed_numbers):
    self._guessed_number = guessed_numbers[0]
  def number_was(self, guessed_number):
    self._guessed_number = guessed_number
  def guess(self):
    return self._guessed_number

For our purposes, laziness is king. Laziness guards us from the over-engineered solutions, and forces our test suite to become more robust. It does this by making our problem-solving faster, and spurring an uncomfortable feeling that will prompt us to test more edge cases.

So, now I want to assert that I don't always choose the same number. I don't want to force it to always choose a different number, but there should be some mixture. To test this, I will refactor my test, and add a new assertion, as follows:

def given_multiple_datapoints_when_asked_to_guess_many_times_test():
  #given
  number_guesser = NumberGuesser()
  previously_chosen_numbers = [1,2,5]
  number_guesser.numbers_were(previously_chosen_numbers)
  #when
  guessed_numbers = [number_guesser.guess() for i in range(0,100)]
  #then
  for guessed_number in guessed_numbers:
    assert guessed_number in previously_chosen_numbers, 'every guess should be one of the previously chosen numbers'
  assert len(set(guessed_numbers)) > 1, "It shouldn't always guess the same number."

I get the test failure message It shouldn't always guess the same number, which is perfect. This test also causes others to fail, so I will work out the simplest thing that I can do to make everything green again, and I will end up here:

import random
class NumberGuesser:
  """Guesses numbers based on the history of your input"""
  def __init__(self):
    self._guessed_numbers = None
  def numbers_were(self, guessed_numbers):
    self._guessed_numbers = guessed_numbers
  def number_was(self, guessed_number):
    self._guessed_numbers = [guessed_number]
  def guess(self):
    if self._guessed_numbers == None:
      return None
    return random.choice(self._guessed_numbers)

There are probably many ways that one could get this test to pass. We've solved it this way because it's first to my mind, and feels like it's leading us in a good direction. What refactoring do we want to do here? Each method is a single line except the guess method. The guess method is still pretty simple, so let's keep going.

Now, I notice that if I've just used number_was to enter the observations of the previous numbers, it will only ever guess the previous number, which is bad. So, I need another test to catch this. Let's write the new test (this should be our fourth):

def given_a_starting_set_of_observations_followed_by_a_one_off_observation_test():
    #given
  number_guesser = NumberGuesser()
  previously_chosen_numbers = [1,2,5]
  number_guesser.numbers_were(previously_chosen_numbers)
  one_off_observation = 0
  number_guesser.number_was(one_off_observation)
  #when
  guessed_numbers = [number_guesser.guess() for i in range(0,100)]
  #then
  for guessed_number in guessed_numbers:
    assert guessed_number in previously_chosen_numbers + [one_off_observation], 'every guess should be one of the previously chosen numbers'
  assert len(set(guessed_numbers)) > 1, "It shouldn't always guess the same number."

This fails on the last assertion, which is perfect. I will make the test pass using the following code:

import random
class NumberGuesser:
  """Guesses numbers based on the history of your input"""
  def __init__(self):
    self._guessed_numbers = []
  def numbers_were(self, guessed_numbers):
    self._guessed_numbers = guessed_numbers
  def number_was(self, guessed_number):
    self._guessed_numbers.append(guessed_number)
  def guess(self):
    if self._guessed_numbers == []:
      return None
    return random.choice(self._guessed_numbers)

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

There are other issues here that I can write for the failing tests. Notice that if I were to provide a single observation and provide a set of observations, every assertion that I've listed so far would succeed. So, I write a new test to ensure that NumberGuesser guesses every number at least once. We can code this up in the following way:

def given_a_one_off_observation_followed_by_a_set_of_observations_test():
    #given
  number_guesser = NumberGuesser()
  previously_chosen_numbers = [1,2]
  one_off_observation = 0
  all_observations = previously_chosen_numbers + [one_off_observation]
  number_guesser.number_was(one_off_observation)
  number_guesser.numbers_were(previously_chosen_numbers)
  #when
  guessed_numbers = [number_guesser.guess() for i in range(0,100)]
  #then
  for guessed_number in guessed_numbers:
    assert guessed_number in all_observations, 'every guess should be one of the previously chosen numbers'
  assert len(set(guessed_numbers)) == len(all_observations), "It should eventually guess every number at least once."

And my final code looks like the following:

import random

class NumberGuesser:
  """Guesses numbers based on the history of your input"""
  def __init__(self):
    self._guessed_numbers = []
  def numbers_were(self, guessed_numbers):
    self._guessed_numbers += guessed_numbers
  def number_was(self, guessed_number):
    self._guessed_numbers.append(guessed_number)
  def guess(self):
    if self._guessed_numbers == []:
      return None
    return random.choice(self._guessed_numbers)

Technically, there is a chance that this test will fail just due to a random chance. The probability of this test failing for this reason is 0.5^100, which is 7.9 x 10^-31. Basically, the chance is zero.

 

Different approaches to validating the improved models


Model quality validation, of course, depends upon the kinds of models that you're building, and the purpose of them. There are a few general types of machine learning problems that I've covered in this book, and each has different ways of validating model quality.

Classification overview

We'll get to the specifics in just a moment, but let's review the high-level terms. One method for quantifying the quality of a supervised classification is using ROC curves. These can be quantified by finding the total area under the curve (AUC), finding the location of the inflection point, or by simply setting a limit of the amount of data that must be classified correctly against percentage of the time.

Another common technique is that of a confusion matrix. Limits can be set on certain cells of the matrix to help drive testing. Also, they can be used as a diagnostic tool that can help identify the issues that come up.

We will typically use the k-fold cross validation. Cross validation is a technique where we take our sample dataset and divide it into several separate datasets. We can then use one of these datasets to develop against one of the others, to validate that our data isn't overfitted, and a third dataset for a final check to see whether the others went well. All of these separate datasets work to make sure that we develop a generally applicable model, and not just one that predicts our training data but falls apart in production.

Regression

Linear regression quality is typically quantified with the combination of an adjusted- R2 value and by checking this, the residuals of the model don't fit a pattern. How do we check for this in an automated test?

The adjusted R2 values are provided by the most statistical tools. It's a quick measure of how much of the variations in the data is explained by your model. Checking model assumptions is more difficult. It is much easier to see patterns visually than via discrete, specific tests.

So, this is hard but there are other tests… perhaps, even more important tests that are easier—cross-validation. By selecting strong test datasets with a litany of misbehavior, we can compare R2 statistics from development, to testing, to ready for production. If a serious drop occurs at any point, then we can circle back.

Clustering

Clustering is the way in which we create our classification model. From there, we can test it by cross validating against our data. This can be especially useful in clustering algorithms, such as k-means, where the feedback can help us tune the number of clusters we want to use to minimize the cluster variation. As we move from one cross-validation dataset to another, it's important to remember not to persist with our training data from the previous tests, lest we bias our results.

 

Quantifying the classification models


To make sure that we're on the same page, let's start by looking at an example of an ROC curve and the AUC score. The scikit-learn documentation has an example code to build an ROC curve and calculate AUC, which you can find at http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html.

This ROC curve was built by running a classifier over the famous iris dataset. It shows us the true positive rate (y-axis) that we can get if we allow a given amount of false positive rate (x-axis). For example, if we were good with a 50 percent false positive rate, we would expect to see somewhere around a 90 percent true positive rate. Also, notice that the AUC percentage is 80 percent. Keeping in mind that a perfect classifier would score 100 percent, this seems pretty great. The dashed line in the chart represents a terrible and completely random (read non-predictive) model. An ideal model would be one that is pulled to the upper left-hand corner of the chart as much as possible. You can see in this chart that the model is somewhere between the two, which is pretty good. Whether or not that is acceptable depends on the problem that is being solved. How so?

Well, what if our classifier is attempting to identify the customers who would respond well to an advertisement? Every customer that we show it to who doesn't respond well to it has some chance of never doing business with us again. Let's say (though it's quite extreme) that the cost is so high that we need to eliminate all the false positives. Well, judging from our previous curve, this would mean we would only identify 10-15 percent of the true positives that exist. In this example, the little bit of performance boost is making more money, and so it's working quite well for our situation.

Imagine there's a one in 10,000 chance that if we incorrectly show a specific ad to someone, they'll sue us and it will cost us on average $25,000. Now, what does a good model look like? Here's a chart that I've created from the same previous ROC data, but with the following new set of parameters:

The maximum profit occurs right around a 1.9 percent false positive rate. As you can see, there is a huge drop off after that, even though this classifier works pretty well. For the purpose of this chapter, we can worry about writing the code for such thing as we progress. For now, it's fine to just have this gain chart. We'll get into guiding our process with these kind of results in future chapters.

 

Summary


In this chapter, you were introduced to TDD, as well as BDD. With these concepts introduced, you have a basic foundation with which to approach machine learning. We saw that specifying behavior in the form of sentences makes it easier to ready a set of specifications for your software.

Building off of that foundation, we started to delve into testing at a higher level. We did this by establishing concepts that we can use to quantify classifiers: the ROC curve and AUC metric. Now we've seen that different models can be quantified, it follows that they can be compared.

Putting all of this together, we have everything we need to explore machine learning with a test-driven methodology. In the next chapter, we will build a simple perceptron algorithm with TDD and measure its quality.

About the Author

  • Justin Bozonier

    Justin Bozonier is a data scientist living in Chicago. He is currently a Senior Data Scientist at GrubHub. He has led the development of their custom analytics platform and also led the development of their first real time split test analysis platform which utilized Bayesian Statistics. In addition he has developed machine learning models for data mining as well as for prototyping product enhancements. Justin's software development expertise has earned him acknowledgements in the books Parallel Programming with Microsoft® .NET as well as Flow-Based Programming, Second Edition. He has also taught a workshop at PyData titled Simplified Statistics through Simulation.

    His previous work experience includes being an Actuarial Systems Developer at Milliman, Inc., contracting as a Software Development Engineer II at Microsoft, and working as a Sr. Data Analyst and Lead Developer at Cheezburger Network amongst other experience.

    Browse publications by this author

Latest Reviews

(3 reviews total)
Test Driven Machine Learning is a n excellent exploration of both TDD and machine learning. The writing style is clear and accessible, and the code samples are easy to follow. Would recommend to anyone who has experience in at least one of the two topics.
Excellent
Machine learning is a relatively new thing for tech publishers, I think Packt has hit it on the head with their choices. Keep going!
Book Title
Unlock this full book FREE 10 day trial
Start Free Trial