Chapter 9. Bringing It All Together
In this chapter, we will consider what we've learned in previous chapters to solve a marketing problem. We'll be using some classification and regression techniques to optimize our spending on an ad campaign. On top of this, we'll build upon the previous chapter so that we can improve our models over time. Besides just using a simplistic measure of model quality, we'll dig in and automate some more metrics to control what we can push to production. The challenge that we'll be solving in this chapter will use something referred to as
uplift modeling.
Here's the idea, let's say you want to launch a new mailing campaign to generate more business for yourself. Sending each letter costs a small amount of money so ideally, you'd like to only send this marketing campaign to the people who you have a reasonable chance of making money from (known as persuadables). On top of this, there are some people who will stop using your service if you contact them (known...
Starting at the highest level
There's a lot going on here. We can simplify it by just thinking about how to solve our high-level problem, and save the other solutions for later. Besides, we've already written the regression and classification algorithms. The worst case is that we may have to refactor them to work with the newer code that will use them. To begin with, we want to build a classifier that will identify the persuadables and sleeping dogs. Using this, we can optimize how we spend ad money to generate new business, and annoy as few of our customers as possible.
Here is one solid high-level test:
Now that we have this harness built that can make recommendations on the customers that we should advertise to, we need to think about the kind of algorithms that we want to plug in it. For the probability of a customer placing an order, we can use Logistic Regression or Naïve Bayes. To estimate how much money the customer might spend, we can use (depending on our data) Gaussian Naïve Bayes or Linear Regression.
To start off with, let's use Linear Regression and Logistic Regression. The main purpose of doing this is to use more sklearn as, if we do, we won't have to spend time building the algorithms ourselves.
When we begin, it may be helpful to create a test file just to explore sklearn like in the previous chapter. We already have some generated data at https://github.com/jcbozonier/Machine-Learning-Test-by-Test/blob/master/Chapter%209/fake_data.json.
The Logistic Regression model in sklearn is only helpful if we can use it to get at the probability that someone will order...
We've gone from machine learning algorithms like linear regression to difficult to wield tools like random forests, to building our own Gaussian Naïve Bayes tool. We've built a custom platform that helps us identify the models that perform the best, and choose the best one when it spins and washes everything.
We've also taken a very pragmatic approach to TDD. There are better things that we could be doing, but these will improve over time. The most important tip to keep in mind is to ask yourself why your tests are getting in the way (when they do), and how you can make your tests run faster whenever they start to run too slow.
In this chapter, we modeled a somewhat complex set of data for us to optimize the money that was spent on the given ad campaign. In the beginning of this book, I foreshadowed that we would be discussing measuring machine learning via profit. This is a great example of it. By combining multiple techniques, we can create models suited to solving real-world problems. On top of this, we saw some more ways of working with sklearn that prevents the coupling of your code with sklearn tightly.
Moving on from here, you can expect to spend less time manually implementing machine learning algorithms, and spending more time learning to use sklearn's built-in models. We haven't even tapped sklearn's pipeline features, nor its wide array of tunable parameters for the many machine learning algorithms that it supports. In most of the classification models, sklearn supports providing you a probability of a given classification. As we saw in this chapter, this can be a powerful tool when combined with...