Reader small image

You're reading from  Data Science Projects with Python - Second Edition

Product typeBook
Published inJul 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800564480
Edition2nd Edition
Languages
Concepts
Right arrow
Author (1)
Stephen Klosterman
Stephen Klosterman
author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman

Right arrow

4. The Bias-Variance Trade-Off

Overview

In this chapter, we'll cover the remaining elements of logistic regression, including what happens when you call .fit to train the model, and the statistical assumptions you should be aware of when using this modeling technique. You will learn how to use L1 and L2 regularization with logistic regression to prevent overfitting and how to use the practice of cross-validation to decide the regularization strength. After reading this chapter, you will be able to use logistic regression in your work and employ regularization in the model fitting process to take advantage of the bias-variance trade-off and improve model performance on unseen data.

Introduction

In this chapter, we will introduce the remaining details of logistic regression left over from the previous chapter. In addition to being able to use scikit-learn to fit logistic regression models, you will gain insight into the gradient descent procedure, which is similar to the processes that are used "under the hood" (invisible to the user) to accomplish model fitting in scikit-learn. Finally, we'll complete our discussion of the logistic regression model by familiarizing ourselves with the formal statistical assumptions of this method.

We begin our exploration of the foundational machine learning concepts of overfitting, underfitting, and the bias-variance trade-off by examining how the logistic regression model can be extended to address the overfitting problem. After reviewing the mathematical details of the regularization methods that are used to alleviate overfitting, you will learn a useful practice for tuning the hyperparameters of regularization...

Estimating the Coefficients and Intercepts of Logistic Regression

In the previous chapter, we learned that the coefficients of a logistic regression model (each of which goes with a particular feature), as well as the intercept, are determined using the training data when the .fit method is called on a logistic regression model in scikit-learn. These numbers are called the parameters of the model, and the process of finding the best values for them is called parameter estimation. Once the parameters are found, the logistic regression model is essentially a finished product: with just these numbers, we can use a logistic regression model in any environment where we can perform common mathematical functions.

It is clear that the process of parameter estimation is important, since this is how we can make a predictive model from our data. So, how does parameter estimation work? To understand this, the first step is to familiarize ourselves with the concept of a cost function. A cost...

Cross-Validation: Choosing the Regularization Parameter

By now, you may suspect that we could use regularization in order to decrease the overfitting we observed when we tried to model the synthetic data in Exercise 4.02, Generating and Modeling Synthetic Classification Data. The question is, how do we choose the regularization parameter, C? C is an example of a model hyperparameter. Hyperparameters are different from the parameters that are estimated when a model is trained, such as the coefficients and the intercept of a logistic regression. Rather than being estimated by an automated procedure like the parameters are, hyperparameters are input directly by the user as keyword arguments, typically when instantiating the model class. So, how do we know what values to choose?

Hyperparameters are more difficult to estimate than parameters. This is because it is up to the data scientist to determine what the best value is, as opposed to letting an optimization algorithm find it. However...

Summary

In this chapter, we introduced the final details of logistic regression and continued to understand how to use scikit-learn to fit logistic regression models. We gained more visibility into how the model fitting process works by learning about the concept of a cost function, which is minimized by the gradient descent procedure to estimate parameters during model fitting.

We also learned of the need for regularization by introducing the concepts of underfitting and overfitting. In order to reduce overfitting, we saw how to adjust the cost function to regularize the coefficients of a logistic regression model using an L1 or L2 penalty. We used cross-validation to select the amount of regularization by tuning the regularization hyperparameter. To reduce underfitting, we saw how to do some simple feature engineering with interaction features for the case study data.

We are now familiar with some of the most important concepts in machine learning. We have, so far, only used...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Science Projects with Python - Second Edition
Published in: Jul 2021Publisher: PacktISBN-13: 9781800564480
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stephen Klosterman

Stephen Klosterman is a Machine Learning Data Scientist with a background in math, environmental science, and ecology. His education includes a Ph.D. in Biology from Harvard University, where he was an assistant teacher of the Data Science course. His professional experience includes work in the environmental, health care, and financial sectors. At work, he likes to research and develop machine learning solutions that create value, and that stakeholders understand. In his spare time, he enjoys running, biking, paddleboarding, and music.
Read more about Stephen Klosterman