You're reading from R Statistics Cookbook

Product typeBook

Published inMar 2019

Reading LevelExpert

PublisherPackt

ISBN-139781789802566

Edition1st Edition

Languages

Tools

ggplot

Concepts

Statistics

Author (1)

Francisco Juretig

Linear Regression

We will cover the following recipes in this chapter:

Computing ordinary least squares estimates
Reporting results with the sjPlot package
Finding correlation between the features
Testing hypothesis
Testing homoscedasticity
Implementing sandwich estimators
Variable selection
Ridge regression
Working with LASSO
Leverage, residuals, and influence

Introduction

Linear regression is perhaps the most important tool in statistics. It can be used in a wide array of situations and can be easily extended to work in those cases where it can't work in principle. Conceptually, the idea is to model a dependent variable in terms of a set of independent variables and capture coefficients that relate each independent variable to the dependent one. The usual formula here is as follows (assuming that we have one variable and an intercept):

Here, the beta and the intercept are coefficients that we need to find. x_i is the independent variable, u_i is an unobserved residual, and y_i is the target variable. The previous formula can naturally be extended to multiple variables. In that case we would have multiple /beta coefficients.

Maybe the most important aspect of linear regression is that we can do very simple yet powerful interpretations...

Computing ordinary least squares estimates

Ordinary least squares estimates are derived from minimizing the sum of the squared residuals. It can be proven that this minimisation leads to . It should be noted that we need to compute an inverse, and that can only be done if the determinant is different from zero. The determinant will be zero if there is a linear dependency between the variables.

It can also be proven that the beta coefficients are distributed according to a Gaussian distribution with variances equal to the diagonal elements of where is the estimated residual standard error.

How to do it...

In this exercise, we will simulate some data, and compute the estimates using both the lm function and doing the...

Reporting results with the sjPlot package

Exporting our linear regression results for publication is usually a cumbersome task, because there is a lot of important content in them (p-values, coefficients, other fit metrics) and R does not print particularly nice tables.

One option is to export these numbers and create a new table in any text-editing software. But that takes a lot of effort, and it never looks that great.

The sjPlot package can be used for creating publication-grade output values such as tables and plots, and it's not just restricted to operate with linear models, it can also work with a wide array of techniques (such as principal components and clustering).

Getting ready

The sjPlot package needs...

Finding correlation between the features

In a linear model, the correlation between the features increases the variance for the associated parameters (the parameters related to those variables). The more correlation we have, the worse it is. The situation is even worse when we have almost perfect correlation between a subset of variables: in that case, the algorithm that we use to fit linear models doesn't even work. The intuition is the following: if we want to model the impact of a discount (yes-no) and the weather (rain–not rain) on the ice cream sales for a restaurant, and we only have promotions on every rainy day, we would have the following design matrix (where Promotion=1 is yes and Weather=1 is rain):

Promotion	Weather
1	1
1	1
0	0
0	0

This is problematic, because every time one of them is 1, the other is 1 as well. The model cannot identify...

Testing hypothesis

After a model is fitted, we get coefficients for each variable. In general, the relevant test is whether a coefficient is zero or not. If it is zero, it can be safely removed from the model. But sometimes we want to do more complex tests, involving possibly several variables, for example, testing whether the combined coefficients of variable1 and variable2 are equal to variable3.

The way this works is that we will define a contrast, and we will then estimate the significance for that contrast. We will do this using the multcomp package, which allows us to test linear hypotheses for lots of models.

Getting ready

In order to run this recipe, you will need to install the multcomp package via the command...

Testing homoscedasticity

The ordinary least squares algorithm generates estimates that are unbiased (the expected values are equal to the true values), consistent (converge in probability to the true estimates), and with the minimal variance among unbiased estimates (when we get more data, the estimates don't change much, compared to other techniques). Also, the estimates are distributed according to a Gaussian distribution. But all of this occurs when certain conditions are met, in particular the following ones:

The residuals should be homoscedastic (same variance).
The residuals should not be correlated, which generally occurs with temporal data.
There is no perfect correlation between variables (or linear combinations of variables).
Exogeneity—the regressors are not correlated with the error term.
The model is linear and is correctly specified.
There should...

Implementing sandwich estimators

We have seen that the residuals should be homoscedastic (the variance should be the same), and in case that doesn't happen, the distribution of the t-values is no longer t-Student. The relevant question is naturally how we can fix this. The so-called sandwich estimators from the sandwich package allow us to use heteroscedasticity-robust standard errors. With this correction, we can still use the t-tests as usual. The best thing is that this is easy to implement.

Getting ready

The sandwich and the lmtest packages need to be installed via install.packages().

How to do it...

...

Variable selection

A fundamental question when doing linear regression is how to choose the best subset of variables that we have already included. Every variable that is added to a model changes the standard errors of the other variables already included. Consequently, the p-values also change, and the order is relevant. This happens because in general the variables are correlated, causing the coefficients' covariance matrix to change (hence changing the standard errors). Sandwich estimators use a different formula for the standard errors. Note the Ω which is the new element here. This matrix is estimated by the sandwich package. This formula also explicits why this is called the sandwich method (the Ω gets sandwiched between two equal expressions). Sandwich estimators use a different formula for the standard errors. Note the Ω which is the new element...

Ridge regression

When doing linear regression, if we include a variable that is severely correlated with our regressors, we will be inflating our standard errors for those correlated variables. This happens because, if two variables are correlated, the model can't be sure to which one it should be assigning the effect/coefficient. Ridge Regression allows us to model highly correlated regressors, by introducing a bias. Our first thought in statistics is to avoid biased coefficients at all cost. But they might not be that bad after all: if the coefficients are biased but have a much smaller variance than our baseline method, we will be in a better situation. Unbiased coefficients with a high variance will change a lot between different model runs (unstable) but they will converge in probability to the right place. Biased coefficients with a low variance will be quite stable...

Working with LASSO

In the previous recipe, we saw that Ridge Regression gives us much more stable coefficients, at the cost of a small bias (the coefficients are compressed to a smaller size than they should). It is based on the L2 regularization norm, which is essentially the squared sum of the coefficients. In order to do that, we used the glmnet package, which allows us to decide how much Ridge/Lasso regularization we want.

Getting ready

Lets install same packages as in the previous recipe: glmnet, ggplot2, tidyr, and MASS. They can be installed via install.packages().

How to do it...

...

Leverage, residuals, and influence

For each observation used in a model, there are three relevant metrics that help us to understand the impact of it on the estimated coefficients. The first metric is the leverage: the potential of an observation to change the estimated coefficient. The second relevant metric is the residual, which is the difference between the prediction and the observed value. Finally, the third is the influence, which can be thought of as the product between the leverage and the residual(ness). Another way of looking at this would be to think of the leverage as the horizontal distance between an observation and the rest of the regression line and the residual as the vertical distance between the observation and the regression line. Essentially, we can have four cases, as depicted in the following graphs:

In A, we have an observation with a high residual...

The rest of the chapter is locked

You have been reading a chapter from

R Statistics Cookbook

Published in: Mar 2019Publisher: PacktISBN-13: 9781789802566

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig

Other recommended products

Related to this chapter

Data Analysis with IBM SPSS Statistics

SPSS Statistics is a software package used for logical batched and non-batched statistical analysis. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. This book will have a comprehensive coverage of IBM’s premier statistics and data analysis tool – IBM SPSS Statistics. It is designed for business professionals who wish to analyze their data. By the end of this book, you will have a firm understanding of the various statistical analysis techniques offered by SPSS Statistics, and be able to master its use for data analysis with ease.

BookSep 2017446 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Bayesian Analysis with Python

Bayesian inference uses probability distributions and Bayes' theorem to build flexible models. The book uses PyMC3 to abstract all the mathematical and computational details from this process allowing readers to solve a wide range of problems in data science.

BookDec 2018356 pages4

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Practical Time Series Analysis

Practical Time Series Analysis will introduce you to the basic concepts of time series analysis and describe powerful yet simple techniques in Python which data scientists and data engineers would find useful in dealing with real life datasets in industrial settings. This book focuses on explaining important concepts and practical techniques to process, summarize and model time series data. Real life case studies with code snippets in Python are used to demonstrate the concepts and techniques.

BookSep 2017244 pages

Hands-On Time Series Analysis with R

This book introduces you to time series analysis and forecasting with R; this is one of the key fields in statistical programming and includes techniques for analyzing data to extract meaningful insights. You will explore methods, such as prediction with time series analysis, and identify the relationship between each data point in the series.

BookMay 2019448 pages

Data Analysis with R

R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

BookMar 2018570 pages

Statistical Application Development with R and Python

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

BookAug 2017432 pages

Learning Quantitative Finance with R

This book covers applications of quantitative finance in R. It starts with the basics of quantitative finance and goes to complexity at the end of the book along with a varying degree of R complexity. This will guide you to implement different trading strategies for various financial instruments using basic to complex techniques along with its optimization and keeping the risk of financial instruments in check.

BookMar 2017284 pages

Practical Machine Learning Cookbook

Machine learning is the new BLACK GOLD. In this book, we explore topics such as classification, clustering, model selection and regularization, nonlinearity, supervised, unsupervised, and reinforcement learning, structured prediction, neural networks, deep learning, and case studies. The algorithms are developed using R.The book is for students and professionals in the field of statistics, data analytics, and computer science.

BookApr 2017570 pages

Advanced Analytics with R and Tableau

R is the go-to tool for statistics and data mining while Tableau offers an interface to filter data, plug and play with rich visualizations to describe insights from your data. When combined these two tools makes it easier to harness interesting patterns and communicate stories. This book covers various analytical techniques like prediction, classification, clustering and best practices to visualize it using interactive dashboard with drop-downs, sliders, and other visual cues of Tableau. Get to know how R can be used in conjunction with Tableau and implement powerful machine learning techniques making big data analytics accessible and presentable through Tableau workbooks.

BookAug 2017178 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages