Reader small image

You're reading from  Machine Learning with PyTorch and Scikit-Learn

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801819312
Edition1st Edition
Right arrow
Authors (3):
Sebastian Raschka
Sebastian Raschka
author image
Sebastian Raschka

Sebastian Raschka is an Assistant Professor of Statistics at the University of Wisconsin-Madison focusing on machine learning and deep learning research. As Lead AI Educator at Grid AI, Sebastian plans to continue following his passion for helping people get into machine learning and artificial intelligence.
Read more about Sebastian Raschka

Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

Vahid Mirjalili
Vahid Mirjalili
author image
Vahid Mirjalili

Vahid Mirjalili is a deep learning researcher focusing on CV applications. Vahid received a Ph.D. degree in both Mechanical Engineering and Computer Science from Michigan State University.
Read more about Vahid Mirjalili

View More author details
Right arrow

Predicting Continuous Target Variables with Regression Analysis

Throughout the previous chapters, you learned a lot about the main concepts behind supervised learning and trained many different models for classification tasks to predict group memberships or categorical variables. In this chapter, we will dive into another subcategory of supervised learning: regression analysis.

Regression models are used to predict target variables on a continuous scale, which makes them attractive for addressing many questions in science. They also have applications in industry, such as understanding relationships between variables, evaluating trends, or making forecasts. One example is predicting the sales of a company in future months.

In this chapter, we will discuss the main concepts of regression models and cover the following topics:

  • Exploring and visualizing datasets
  • Looking at different approaches to implementing linear regression models
  • Training regression...

Introducing linear regression

The goal of linear regression is to model the relationship between one or multiple features and a continuous target variable. In contrast to classification—a different subcategory of supervised learning—regression analysis aims to predict outputs on a continuous scale rather than categorical class labels.

In the following subsections, you will be introduced to the most basic type of linear regression, simple linear regression, and understand how to relate it to the more general, multivariate case (linear regression with multiple features).

Simple linear regression

The goal of simple (univariate) linear regression is to model the relationship between a single feature (explanatory variable, x) and a continuous-valued target (response variable, y). The equation of a linear model with one explanatory variable is defined as follows:

Here, the parameter (bias unit), b, represents the y axis intercept and w1 is the weight coefficient...

Exploring the Ames Housing dataset

Before we implement the first linear regression model, we will discuss a new dataset, the Ames Housing dataset, which contains information about individual residential property in Ames, Iowa, from 2006 to 2010. The dataset was collected by Dean De Cock in 2011, and additional information is available via the following links:

As with each new dataset, it is always helpful to explore the data through a simple visualization, to get a better feeling of what we are working with, which is what we will do in the following subsections.

Loading the Ames Housing dataset into a DataFrame

In this section, we will load the Ames Housing dataset...

Implementing an ordinary least squares linear regression model

At the beginning of this chapter, we mentioned that linear regression can be understood as obtaining the best-fitting straight line through the examples of our training data. However, we have neither defined the term best-fitting nor have we discussed the different techniques of fitting such a model. In the following subsections, we will fill in the missing pieces of this puzzle using the ordinary least squares (OLS) method (sometimes also called linear least squares) to estimate the parameters of the linear regression line that minimizes the sum of the squared vertical distances (residuals or errors) to the training examples.

Solving regression for regression parameters with gradient descent

Consider our implementation of the Adaptive Linear Neuron (Adaline) from Chapter 2, Training Simple Machine Learning Algorithms for Classification. You will remember that the artificial neuron uses a linear activation function...

Fitting a robust regression model using RANSAC

Linear regression models can be heavily impacted by the presence of outliers. In certain situations, a very small subset of our data can have a big effect on the estimated model coefficients. Many statistical tests can be used to detect outliers, but these are beyond the scope of the book. However, removing outliers always requires our own judgment as data scientists as well as our domain knowledge.

As an alternative to throwing out outliers, we will look at a robust method of regression using the RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of the data, the so-called inliers.

We can summarize the iterative RANSAC algorithm as follows:

  1. Select a random number of examples to be inliers and fit the model.
  2. Test all other data points against the fitted model and add those points that fall within a user-given tolerance to the inliers.
  3. Refit the model using all inliers.
  4. ...

Evaluating the performance of linear regression models

In the previous section, you learned how to fit a regression model on training data. However, you discovered in previous chapters that it is crucial to test the model on data that it hasn’t seen during training to obtain a more unbiased estimate of its generalization performance.

As you may remember from Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we want to split our dataset into separate training and test datasets, where we will use the former to fit the model and the latter to evaluate its performance on unseen data to estimate the generalization performance. Instead of proceeding with the simple regression model, we will now use all five features in the dataset and train a multiple regression model:

>>> from sklearn.model_selection import train_test_split
>>> target = 'SalePrice'
>>> features = df.columns[df.columns != target]
>>...

Using regularized methods for regression

As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-Learn, regularization is one approach to tackling the problem of overfitting by adding additional information and thereby shrinking the parameter values of the model to induce a penalty against complexity. The most popular approaches to regularized linear regression are the so-called ridge regression, least absolute shrinkage and selection operator (LASSO), and elastic net.

Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to the MSE loss function:

Here, the L2 term is defined as follows:

By increasing the value of hyperparameter , we increase the regularization strength and thereby shrink the weights of our model. Please note that, as mentioned in Chapter 3, the bias unit b is not regularized.

An alternative approach that can lead to sparse models is LASSO. Depending on the regularization strength...

Turning a linear regression model into a curve – polynomial regression

In the previous sections, we assumed a linear relationship between explanatory and response variables. One way to account for the violation of linearity assumption is to use a polynomial regression model by adding polynomial terms:

Here, d denotes the degree of the polynomial. Although we can use polynomial regression to model a nonlinear relationship, it is still considered a multiple linear regression model because of the linear regression coefficients, w. In the following subsections, we will see how we can add such polynomial terms to an existing dataset conveniently and fit a polynomial regression model.

Adding polynomial terms using scikit-learn

We will now learn how to use the PolynomialFeatures transformer class from scikit-learn to add a quadratic term (d = 2) to a simple regression problem with one explanatory variable. Then, we will compare the polynomial to the linear fit by...

Dealing with nonlinear relationships using random forests

In this section, we are going to look at random forest regression, which is conceptually different from the previous regression models in this chapter. A random forest, which is an ensemble of multiple decision trees, can be understood as the sum of piecewise linear functions, in contrast to the global linear and polynomial regression models that we discussed previously. In other words, via the decision tree algorithm, we subdivide the input space into smaller regions that become more manageable.

Decision tree regression

An advantage of the decision tree algorithm is that it works with arbitrary features and does not require any transformation of the features if we are dealing with nonlinear data because decision trees analyze one feature at a time, rather than taking weighted combinations into account. (Likewise, normalizing or standardizing features is not required for decision trees.) As mentioned in Chapter 3, A...

Summary

At the beginning of this chapter, you learned about simple linear regression analysis to model the relationship between a single explanatory variable and a continuous response variable. We then discussed a useful explanatory data analysis technique to look at patterns and anomalies in data, which is an important first step in predictive modeling tasks.

We built our first model by implementing linear regression using a gradient-based optimization approach. You then saw how to utilize scikit-learn’s linear models for regression and also implement a robust regression technique (RANSAC) as an approach for dealing with outliers. To assess the predictive performance of regression models, we computed the mean sum of squared errors and the related R2 metric. Furthermore, we also discussed a useful graphical approach for diagnosing the problems of regression models: the residual plot.

After we explored how regularization can be applied to regression models to reduce...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with PyTorch and Scikit-Learn
Published in: Feb 2022Publisher: PacktISBN-13: 9781801819312
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at AU $19.99/month. Cancel anytime

Authors (3)

author image
Sebastian Raschka

Sebastian Raschka is an Assistant Professor of Statistics at the University of Wisconsin-Madison focusing on machine learning and deep learning research. As Lead AI Educator at Grid AI, Sebastian plans to continue following his passion for helping people get into machine learning and artificial intelligence.
Read more about Sebastian Raschka

author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

author image
Vahid Mirjalili

Vahid Mirjalili is a deep learning researcher focusing on CV applications. Vahid received a Ph.D. degree in both Mechanical Engineering and Computer Science from Michigan State University.
Read more about Vahid Mirjalili