Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learn Python by Building Data Science Applications

You're reading from  Learn Python by Building Data Science Applications

Product type Book
Published in Aug 2019
Publisher Packt
ISBN-13 9781789535365
Pages 482 pages
Edition 1st Edition
Languages
Authors (2):
Philipp Kats Philipp Kats
Profile icon Philipp Kats
David Katz David Katz
Profile icon David Katz
View More author details

Table of Contents (26) Chapters

Preface Section 1: Getting Started with Python
Preparing the Workspace First Steps in Coding - Variables and Data Types Functions Data Structures Loops and Other Compound Statements First Script – Geocoding with Web APIs Scraping Data from the Web with Beautiful Soup 4 Simulation with Classes and Inheritance Shell, Git, Conda, and More – at Your Command Section 2: Hands-On with Data
Python for Data Applications Data Cleaning and Manipulation Data Exploration and Visualization Training a Machine Learning Model Improving Your Model – Pipelines and Experiments Section 3: Moving to Production
Packaging and Testing with Poetry and PyTest Data Pipelines with Luigi Let's Build a Dashboard Serving Models with a RESTful API Serverless API Using Chalice Best Practices and Python Performance Assessments Other Books You May Enjoy

Improving Your Model – Pipelines and Experiments

In the previous chapter, we trained a basic machine learning (ML) model. However, most real-world scenarios require models to be accurate, and that means the model and features need to be improved and fine-tuned for a specific task. This process is usually long, iterative, and based on trial and error.

So, in this chapter, we will see how we can improve and validate model quality and keep track of all of the experiments along the way. As a result, we will improve the quality of the model and learn how to track our experiments and log metrics and parameters. In particular, we'll learn the following:

  • Understanding cross-validation and overfitting
  • Adding features in order to improve models
  • Wrapping models and transformations into pipelines
  • Version control of our datasets and metrics using the dvc package
...

Technical requirements

Understanding cross-validation

In the previous chapter, we built a model with certain assumptions and settings, measuring its performance with accuracy metrics (the overall ratio of correctly classified labels). To do this, we split our data randomly into training and testing sets. While that approach is fundamental, it has its problems. Most importantly, this way, we may fine-tune our model to gain better performance on the test dataset but at the expense of other data (in other words, we might make the model worse while getting a better metric on the specific dataset). This phenomenon is called overfitting.

To combat this issue, we'll use a slightly more complex approach: cross-validation. In its basic form, cross-validation creates multiple so-called folds or data subsections. Usually, each fold has approximately the same size and can be further balanced by target variable...

Exploring feature engineering

Now that we made a system to fairly compare models with no fear of overfitting, let's think about how we can improve our model. One way would be to create new features that might add more context. One way to go about this is to create features of our own, for example, calculate a proportion of armies on different sides or the absolute difference in the number of soldiers—we can't say in advance which would work better. Let's try it out with the help of the following code:

  1. First, we'll create a ratio of soldiers on either side:
data['infantry_ratio'] = data['allies_infantry'] / data['axis_infantry']
cols.append('infantry_ratio')
  1. Now, we won't do that for tanks, planes, and so on, as the numbers here are very small and we'll have to deal with division by zero. Instead, we...

Optimizing the hyperparameters

There are probably a lot of other features to add, but let's now shift our attention to the model itself. For now, we assumed the default, static parameters of the model, restricting its max_depth parameter to an arbitrary number. Now, let's try to fine-tune those parameters. If done properly, this process could add a few additional percentage points to the model accuracy, and sometimes, even a small gain in performance metrics can be a game-changer.

To do this, we'll use RandomizedSearchCV—another wrapper around the concept of cross-validation, but this time, one that iterates over parameters of the model, trying to find the optimal ones. A simpler approach, called GridSearchCV, takes a finite number of parameters, creates all of the permutations, and runs them all iteratively using, essentially, a brute-force approach.

Randomized...

Tracking your data and metrics with version control

As with all ML projects, there is always room for improvement—especially if we converge on the actual use case scenario. But let's switch gears and talk about the technical side of the question.

As you probably noticed, in this chapter, we had to constantly iterate, adding and removing features from the data or settings to the model. And again, as we mentioned, only one-third of the initial experiments went into this book. This is probably fine for this toy dataset and this third of the code but eventually, we might be swamped in different versions and iterations of the model.

In Chapter 9, Shell, Git, Conda, and More – at Your Command, of this book, we learned about git—a system that stores versions of code, so you can safely switch to the previous version or even keep work on different versions of...

Summary

Over the course of this chapter, we worked iteratively on improving the machine learning model we built in Chapter 13, Training a Machine Learning Model—adding features and tuning it to achieve maximum performance. As the code and iterations get more complex and multiple trial-and-error attempts are required, it is important to keep track of your research. Therefore, we further discussed how to keep track of not only the code but also data and metrics, making sure we can always switch back and reproduce any of the previous versions.

In the next chapter, we'll take another stab at our Wikipedia scraping code, building it into an independent Python library you could share with your friends and colleagues. Throughout the rest of this book, we will focus on different ways of delivering our code as a product to the client—as a standalone package, scheduled...

Questions

  1. What is overfitting?
  2. Why should we use cross-validation?
  3. Why can it be bad if our metrics are improving on the test set? Which features are useful for improving model performance on cross-validation?
  4. Why do some features decrease the performance of a decision tree on test data or in cross-validation?
  5. What is the difference between the random search and grid search algorithms for parameter optimization?
  6. Why is Git not sufficient for data version control?
  7. What are the alternatives to DVC for data version control and experimentation logging?

Further reading

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019 Publisher: Packt ISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}