Reader small image

You're reading from  R Statistics Cookbook

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789802566
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Francisco Juretig
Francisco Juretig
author image
Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig

Right arrow

Linear Regression

We will cover the following recipes in this chapter:

  • Computing ordinary least squares estimates
  • Reporting results with the sjPlot package
  • Finding correlation between the features
  • Testing hypothesis
  • Testing homoscedasticity
  • Implementing sandwich estimators
  • Variable selection
  • Ridge regression
  • Working with LASSO
  • Leverage, residuals, and influence

Introduction

Linear regression is perhaps the most important tool in statistics. It can be used in a wide array of situations and can be easily extended to work in those cases where it can't work in principle. Conceptually, the idea is to model a dependent variable in terms of a set of independent variables and capture coefficients that relate each independent variable to the dependent one. The usual formula here is as follows (assuming that we have one variable and an intercept):

Here, the beta and the intercept are coefficients that we need to find. xi is the independent variable, ui is an unobserved residual, and yi is the target variable. The previous formula can naturally be extended to multiple variables. In that case we would have multiple /beta coefficients.

Maybe the most important aspect of linear regression is that we can do very simple yet powerful interpretations...

Computing ordinary least squares estimates

Ordinary least squares estimates are derived from minimizing the sum of the squared residuals. It can be proven that this minimisation leads to . It should be noted that we need to compute an inverse, and that can only be done if the determinant is different from zero. The determinant will be zero if there is a linear dependency between the variables.

It can also be proven that the beta coefficients are distributed according to a Gaussian distribution with variances equal to the diagonal elements of where is the estimated residual standard error.

How to do it...

In this exercise, we will simulate some data, and compute the estimates using both the lm function and doing the...

Reporting results with the sjPlot package

Exporting our linear regression results for publication is usually a cumbersome task, because there is a lot of important content in them (p-values, coefficients, other fit metrics) and R does not print particularly nice tables.

One option is to export these numbers and create a new table in any text-editing software. But that takes a lot of effort, and it never looks that great.

The sjPlot package can be used for creating publication-grade output values such as tables and plots, and it's not just restricted to operate with linear models, it can also work with a wide array of techniques (such as principal components and clustering).

Getting ready

The sjPlot package needs...

Finding correlation between the features

In a linear model, the correlation between the features increases the variance for the associated parameters (the parameters related to those variables). The more correlation we have, the worse it is. The situation is even worse when we have almost perfect correlation between a subset of variables: in that case, the algorithm that we use to fit linear models doesn't even work. The intuition is the following: if we want to model the impact of a discount (yes-no) and the weather (rain–not rain) on the ice cream sales for a restaurant, and we only have promotions on every rainy day, we would have the following design matrix (where Promotion=1 is yes and Weather=1 is rain):

Promotion Weather
1 1
1 1
0 0
0 0

This is problematic, because every time one of them is 1, the other is 1 as well. The model cannot identify...

Testing hypothesis

After a model is fitted, we get coefficients for each variable. In general, the relevant test is whether a coefficient is zero or not. If it is zero, it can be safely removed from the model. But sometimes we want to do more complex tests, involving possibly several variables, for example, testing whether the combined coefficients of variable1 and variable2 are equal to variable3.

The way this works is that we will define a contrast, and we will then estimate the significance for that contrast. We will do this using the multcomp package, which allows us to test linear hypotheses for lots of models.

Getting ready

In order to run this recipe, you will need to install the multcomp package via the command...

Testing homoscedasticity

The ordinary least squares algorithm generates estimates that are unbiased (the expected values are equal to the true values), consistent (converge in probability to the true estimates), and with the minimal variance among unbiased estimates (when we get more data, the estimates don't change much, compared to other techniques). Also, the estimates are distributed according to a Gaussian distribution. But all of this occurs when certain conditions are met, in particular the following ones:

  • The residuals should be homoscedastic (same variance).
  • The residuals should not be correlated, which generally occurs with temporal data.
  • There is no perfect correlation between variables (or linear combinations of variables).
  • Exogeneity—the regressors are not correlated with the error term.
  • The model is linear and is correctly specified.
  • There should...

Implementing sandwich estimators

We have seen that the residuals should be homoscedastic (the variance should be the same), and in case that doesn't happen, the distribution of the t-values is no longer t-Student. The relevant question is naturally how we can fix this. The so-called sandwich estimators from the sandwich package allow us to use heteroscedasticity-robust standard errors. With this correction, we can still use the t-tests as usual. The best thing is that this is easy to implement.

Getting ready

The sandwich and the lmtest packages need to be installed via install.packages().

How to do it...

...

Variable selection

A fundamental question when doing linear regression is how to choose the best subset of variables that we have already included. Every variable that is added to a model changes the standard errors of the other variables already included. Consequently, the p-values also change, and the order is relevant. This happens because in general the variables are correlated, causing the coefficients' covariance matrix to change (hence changing the standard errors). Sandwich estimators use a different formula for the standard errors. Note the Ω which is the new element here. This matrix is estimated by the sandwich package. This formula also explicits why this is called the sandwich method (the Ω gets sandwiched between two equal expressions). Sandwich estimators use a different formula for the standard errors. Note the Ω which is the new element...

Ridge regression

When doing linear regression, if we include a variable that is severely correlated with our regressors, we will be inflating our standard errors for those correlated variables. This happens because, if two variables are correlated, the model can't be sure to which one it should be assigning the effect/coefficient. Ridge Regression allows us to model highly correlated regressors, by introducing a bias. Our first thought in statistics is to avoid biased coefficients at all cost. But they might not be that bad after all: if the coefficients are biased but have a much smaller variance than our baseline method, we will be in a better situation. Unbiased coefficients with a high variance will change a lot between different model runs (unstable) but they will converge in probability to the right place. Biased coefficients with a low variance will be quite stable...

Working with LASSO

In the previous recipe, we saw that Ridge Regression gives us much more stable coefficients, at the cost of a small bias (the coefficients are compressed to a smaller size than they should). It is based on the L2 regularization norm, which is essentially the squared sum of the coefficients. In order to do that, we used the glmnet package, which allows us to decide how much Ridge/Lasso regularization we want.

Getting ready

Lets install same packages as in the previous recipe: glmnet, ggplot2, tidyr, and MASS. They can be installed via install.packages().

How to do it...

...

Leverage, residuals, and influence

For each observation used in a model, there are three relevant metrics that help us to understand the impact of it on the estimated coefficients. The first metric is the leverage: the potential of an observation to change the estimated coefficient. The second relevant metric is the residual, which is the difference between the prediction and the observed value. Finally, the third is the influence, which can be thought of as the product between the leverage and the residual(ness). Another way of looking at this would be to think of the leverage as the horizontal distance between an observation and the rest of the regression line and the residual as the vertical distance between the observation and the regression line. Essentially, we can have four cases, as depicted in the following graphs:

In A, we have an observation with a high residual...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Statistics Cookbook
Published in: Mar 2019Publisher: PacktISBN-13: 9781789802566
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig