Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Linear Regression in R

In this chapter, we will introduce linear regression, a fundamental statistical approach that’s used to model the relationship between a target variable and multiple explanatory (also called independent) variables. We will cover the basics of linear regression, starting with simple linear regression and then extending the concepts to multiple linear regression. We will learn how to estimate the model coefficients, evaluate the goodness of fit, and test the significance of the coefficients using hypothesis testing. Additionally, we will discuss the assumptions underlying linear regression and explore techniques to address potential issues, such as nonlinearity, interaction effect, multicollinearity, and heteroskedasticity. We will also introduce two widely used regularization techniques: the ridge and Least Absolute Shrinkage and Selection Operator (lasso) penalties.

By the end of this chapter, you will learn the core principles of linear regression...

Introducing linear regression

At the core of linear regression is the concept of fitting a straight line – or more generally, a hyperplane – to the data points. Such fitting aims to minimize the deviation between the observed and predicted values. When it comes to simple linear regression, one target variable is regressed by one predictor, and the goal is to fit a straight line that best mimics the relationship between the two variables. For multiple linear regression, there is more than one predictor, and the goal is to fit a hyperplane that best describes the relationship among the variables. Both tasks can be achieved by minimizing a measure of deviation between the predictions and the corresponding targets.

In linear regression, obtaining an optimal model means identifying the best coefficients that define the relationship between the target variable and the input predictors. These coefficients represent the change in the target associated with a single unit change...

Introducing penalized linear regression

Penalized regression models, such as ridge and lasso, are techniques that are used to handle problems such as multicollinearity, reduce overfitting, and even perform variable selection, especially when dealing with high-dimensional data with multiple input features.

Ridge regression (also called L2 regularization) is a method that adds a penalty equivalent to the square of the magnitude of coefficients. We would add this term to the loss function after weighting it by an additional hyperparameter, often denoted as λ, to control the strength of the penalty term.

Lasso regression (L1 regularization), on the other hand, is a method that, similar to ridge regression, adds a penalty for non-zero coefficients, but unlike ridge regression, it can force some coefficients to be exactly equal to zero when the penalty tuning parameter is large enough. The larger the value of the hyperparameter, λ, the greater the amount of shrinkage. The...

Working with ridge regression

Ridge regression, also referred to as L2 regularization, is a commonly used technique to alleviate overfitting in linear regression models by penalizing the magnitude of the estimated coefficients in the resulting model.

Recall that in an SLR model, we seek to minimize the sum of the squared differences between our predicted and actual values, which we refer to as the least squares method. The loss function we wish to minimize is the RSS:

RSS =  i=1 n (y i (β 0 +  j=1 p β j x ij)) 2

Here, y i is the actual target value, β 0 is the intercept term, {β j} are the coefficient estimates for each predictor, x ij, and the summations are overall observations and predictors.

Purely minimizing the RSS would give us an overfitting model, as represented by the high magnitude of the resulting coefficients. As a remedy, we could apply...

Working with lasso regression

Lasso regression is another type of regularized linear regression. It is similar to ridge regression but differs in terms of the specific process of calculating the magnitude of the coefficients. Specifically, it uses the L1 norm of the coefficients, which consists of the total sum of absolute values of the coefficients, as the penalty that’s added to the OLS loss function.

The lasso regression cost function can be written as follows:

L lasso = RSS + λ j=1 p | β j|

The key characteristic of lasso regression is that it can reduce some coefficients exactly to 0, effectively performing variable selection. This is a consequence of the L1 penalty term and is not the case for ridge regression, which can only shrink coefficients close to 0. Therefore, lasso regression is particularly useful when we believe that only a subset of the predictors matters when it comes to predicting the outcome.

In addition...

Summary

In this chapter, we covered the nuts and bolts of the linear regression model. We started by introducing the SLR model, which consists of only one input variable and one target variable, and then extended to the MLR model with two or more predictors. Both models can be assessed using R 2, or more preferably, the adjusted R 2 metric. Next, we discussed specific scenarios, such as working with categorical variables and interaction terms, handling nonlinear terms via transformations, working with the closed-form solution, and dealing with multicollinearity and heteroskedasticity. Lastly, we introduced widely used regularization techniques such as ridge and lasso penalties, which can be incorporated into the loss function as a penalty term and generate a regularized model, and, additionally, a sparse solution in the case of lasso regression.

In the next chapter, we will cover another type of widely used linear model: the logistic regression model.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng