Packt+ | Advance your knowledge in tech

You're reading from Test Driven Machine Learning

Product type Book

Published in Nov 2015

Publisher

ISBN-13 9781784399085

Pages 190 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Table of Contents (16) Chapters

Test-Driven Machine Learning

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

1. Introducing Test-Driven Machine Learning

2. Perceptively Testing a Perceptron

3. Exploring the Unknown with Multi-armed Bandits

4. Predicting Values with Regression

5. Making Decisions Black and White with Logistic Regression

6. You're So Naïve, Bayes

7. Optimizing by Choosing a New Algorithm

8. Exploring scikit-learn Test First

9. Bringing It All Together

Index

Chapter 8. Exploring scikit-learn Test First

We've explored building the machine learning models from scratch, we've learned about tuning a classifier, and we've understood the usage of a third-party library, all in a test-driven manner. We haven't really talked about building a full system around these techniques. When we say "full system", we're talking about something that promotes sustainable and controlled software development. If we build an entire machine learning project and aren't careful about it, we can end up being tightly coupled with a third-party library. This is a bad thing because this could put us in a situation where changing the underlying library could mean we have to rewrite the entire system.

Why is this important? After we create our first model that is successful what are our options for further improvement down the road? If we chose a great library then we'll have many options for tuning our model and probably even many options for other models. Do we really want...

Test-driven design

Until now, we've left the concept of test-driven design (TDD) out of discussion. Now that we have some experience in applying TDD to concrete problems, it's a great time to discuss some of the less concrete aspects of it.

Test-driven design has a notion that the biggest value we see from TDD is the result of the code that we design. It's completely fine if this is not immediately obvious. Think about it like this: how does code that is designed with the help of TDD differ from code that is designed up front, and then written? The code designed with the help of TDD is done incrementally, and in response to how effectively the current design is solving problems. Think of your tests as the first "user" of your code. If the code is difficult to test, then it is also probably difficult to use.

We saw an example of this when we built our Naïve Bayes classifier. We were heading down a route that was becoming increasingly difficult to test. Instead of charging ahead without testing...

Planning our journey

TDD is a way to incrementally design our code, but this doesn't mean that we can stop thinking about it. Getting some idea of how we want to approach the problem can actually be quite useful, as long as we're prepared to leave behind our preconceptions if it doesn't prove to be a testable design.

Let's take a moment to review the Naïve Bayes and Random Forest classifiers. What are the methods/functions that they have in common? It looks like batch_train and classify have the same signature, and both appear on both classes. Python doesn't have the concept of interfaces similar to what Java or C# have, but if it did have, we might have an interface that would look like the following:

class Classifier:
  def batch_train(self, observations):
    pass
  def classify(self, observation):
    pass

In programming languages such as C# and Java, interfaces are useful for defining the methods that one expects a class to have, but this happens without specifying any implementation details...

Getting choosey

Next, let's explore hooking up the classifiers that we developed previously. We'll do it within our test framework, but we won't make it a true test yet. Let's just hook it up and poke at it with a stick to start off.

To do so, we can construct a test that must fail so that we can see the output of the strategically placed print statements within our test and ClassifierChooser. This test will be more complex, since it will more closely mimic a real-world scenario. Here it is:

def given_real_classifiers_and_random_data_test():
    class_a_variable_a = numpy.random.normal(loc=51, scale=5, size=1000)
    class_a_variable_b = numpy.random.normal(loc=5, scale=1, size=1000)
    class_a_input = zip(class_a_variable_a, class_a_variable_b)
    class_a_label = ['class a']*len(class_a_input)

    class_b_variable_a = numpy.random.normal(loc=60, scale=7, size=1000)
    class_b_variable_b = numpy.random.normal(loc=8, scale=2, size=1000)
    class_b_input = zip(class_b_variable_a, class_b_variable_b...

Developing testable documentation

In this part of the chapter, we'll just explore different classifier algorithms, and learn the ins and outs of each.

Decision trees

Let's start with decision trees. scikit-learn has some great documentation, which you can find at http://scikit-learn.org/stable/. So, let's jump over there, and look up an example that states how to use their decision tree. The following is a test with the details greatly simplified to get to the simplest possible example:

from sklearn.tree import DecisionTreeRegressor

def decision_tree_can_predict_perfect_linear_relationship_test():
    decision_tree = DecisionTreeRegressor()
    decision_tree.fit([[1],[1.1],[2]], [[0],[0],[1]])
    predicted_value = decision_tree.predict([[-1],[5]])
    assert list(predicted_value) == [0,1]

A good place to start with the most classified algorithms is to assume that they can accurately classify data with linear relationships. This test passed. We can look for more interesting bits to test as...

Summary

We covered a lot of material in this chapter. Once again, we covered moving incrementally in small steps in order to get specific software built. We also leveraged OOP to enable us to test our ClassifierChooser in isolation from our complex machine learning algorithms. Beyond this, we even leveraged creating extremely simple test classifiers to act as our way of decoupling from the more complex algorithms.

We now have the beginnings of a system that can test machine learning algorithms, and choose the best one according to some metric. We've also established a pattern to bring outside algorithms into our project, which includes wrapping the external library in an adapter. This ensures that you can bend the third-party library to your needs rather than bending your system around your third-party library, and making your code brittle (easily broken).

In the next chapter, we will be bringing all of the concepts that we've covered up to this point together. We'll have a project not unlike...