Chapter 8. Exploring scikit-learn Test First
We've explored building the machine learning models from scratch, we've learned about tuning a classifier, and we've understood the usage of a third-party library, all in a test-driven manner. We haven't really talked about building a full system around these techniques. When we say "full system", we're talking about something that promotes sustainable and controlled software development. If we build an entire machine learning project and aren't careful about it, we can end up being tightly coupled with a third-party library. This is a bad thing because this could put us in a situation where changing the underlying library could mean we have to rewrite the entire system.
Why is this important? After we create our first model that is successful what are our options for further improvement down the road? If we chose a great library then we'll have many options for tuning our model and probably even many options for other models. Do we really want...
Until now, we've left the concept of test-driven design (TDD) out of discussion. Now that we have some experience in applying TDD to concrete problems, it's a great time to discuss some of the less concrete aspects of it.
Test-driven design has a notion that the biggest value we see from TDD is the result of the code that we design. It's completely fine if this is not immediately obvious. Think about it like this: how does code that is designed with the help of TDD differ from code that is designed up front, and then written? The code designed with the help of TDD is done incrementally, and in response to how effectively the current design is solving problems. Think of your tests as the first "user" of your code. If the code is difficult to test, then it is also probably difficult to use.
We saw an example of this when we built our Naïve Bayes classifier. We were heading down a route that was becoming increasingly difficult to test. Instead of charging ahead without testing...
TDD is a way to incrementally design our code, but this doesn't mean that we can stop thinking about it. Getting some idea of how we want to approach the problem can actually be quite useful, as long as we're prepared to leave behind our preconceptions if it doesn't prove to be a testable design.
Let's take a moment to review the Naïve Bayes and Random Forest classifiers. What are the methods/functions that they have in common? It looks like batch_train
and classify
have the same signature, and both appear on both classes. Python doesn't have the concept of interfaces similar to what Java or C# have, but if it did have, we might have an interface that would look like the following:
In programming languages such as C# and Java, interfaces are useful for defining the methods that one expects a class to have, but this happens without specifying any implementation details...
Next, let's explore hooking up the classifiers that we developed previously. We'll do it within our test framework, but we won't make it a true test yet. Let's just hook it up and poke at it with a stick to start off.
To do so, we can construct a test that must fail so that we can see the output of the strategically placed print statements within our test and ClassifierChooser
. This test will be more complex, since it will more closely mimic a real-world scenario. Here it is:
Developing testable documentation
In this part of the chapter, we'll just explore different classifier algorithms, and learn the ins and outs of each.
Let's start with decision trees. scikit-learn has some great documentation, which you can find at http://scikit-learn.org/stable/. So, let's jump over there, and look up an example that states how to use their decision tree. The following is a test with the details greatly simplified to get to the simplest possible example:
A good place to start with the most classified algorithms is to assume that they can accurately classify data with linear relationships. This test passed. We can look for more interesting bits to test as...
We covered a lot of material in this chapter. Once again, we covered moving incrementally in small steps in order to get specific software built. We also leveraged OOP to enable us to test our ClassifierChooser
in isolation from our complex machine learning algorithms. Beyond this, we even leveraged creating extremely simple test classifiers to act as our way of decoupling from the more complex algorithms.
We now have the beginnings of a system that can test machine learning algorithms, and choose the best one according to some metric. We've also established a pattern to bring outside algorithms into our project, which includes wrapping the external library in an adapter. This ensures that you can bend the third-party library to your needs rather than bending your system around your third-party library, and making your code brittle (easily broken).
In the next chapter, we will be bringing all of the concepts that we've covered up to this point together. We'll have a project not unlike...