Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with R

Product type Book

Published in Jun 2015

Publisher

ISBN-13 9781783982806

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Authors (2):

Rui Miguel Forte

View More author details

Table of Contents (19) Chapters

Mastering Predictive Analytics with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Gearing Up for Predictive Modeling

Linear Regression

Logistic Regression

Neural Networks

Support Vector Machines

Tree-based Methods

Ensemble Methods

Probabilistic Graphical Models

Time Series Analysis

Topic Modeling

Recommendation Systems

Index

Chapter 2. Linear Regression

We learned from the previous chapter that regression problems involve predicting a numerical output. The simplest but most common type of regression is linear regression. In this chapter, we'll explore why linear regression is so commonly used, its limitations, and extensions.

Introduction to linear regression

In linear regression, the output variable is predicted by a linearly weighted combination of input features. Here is an example of a simple linear model:

The preceding model essentially says that we are estimating one output, denoted by , and this is a linear function of a single predictor variable (that is, a feature) denoted by the letter x. The terms involving the Greek letter β are the parameters of the model and are known as regression coefficients. Once we train the model and settle on values for these parameters, we can make a prediction on the output variable for any value of x by a simple substitution in our equation. Another example of a linear model, this time with three features and with values assigned to the regression coefficients, is given by the following equation:

In this equation, just as with the previous one, we can observe that we have one more coefficient than the number of features. This additional coefficient, β0, is known as the intercept...

Simple linear regression

Before looking at some real-world data sets, it is very helpful to try to train a model on artificially generated data. In an artificial scenario such as this, we know what the true output function is beforehand, something that as a rule is not the case when it comes to real-world data. The advantage of performing this exercise is that it gives us a good idea of how our model works under the ideal scenario when all of our assumptions are fully satisfied, and it helps visualize what happens when we have a good linear fit. We'll begin by simulating a simple linear regression model. The following R snippet is used to create a data frame with 100 simulated observations of the following linear model with a single input feature:

Here is the code for the simple linear regression model:

> set.seed(5427395)
> nObs = 100
> x1minrange = 5
> x1maxrange = 25
> x1 = runif(nObs, x1minrange, x1maxrange)
> e = rnorm(nObs, mean = 0, sd = 2.0)
> y = 1.67 * x1 - 2...

Multiple linear regression

Whenever we have more than one input feature and want to build a linear regression model, we are in the realm of multiple linear regression. The general equation for a multiple linear regression model with k input features is:

Our assumptions about the model and about the error component ε remain the same as with simple linear regression, remembering that as we now have more than one input feature, we assume that these are independent of each other. Instead of using simulated data to demonstrate multiple linear regression, we will analyze two real-world data sets.

Predicting CPU performance

Our first real-world data set was presented by the researchers Dennis F. Kibler, David W. Aha, and Marc K. Albert in a 1989 paper titled Instance-based prediction of real-valued attributes and published in Journal of Computational Intelligence. The data contain the characteristics of different CPU models, such as the cycle time and the amount of cache memory. When deciding between...

Assessing linear regression models

We'll proceed once again with using the lm() function to fit linear regression models to our data. For both of our data sets, we'll want to use all the input features that remain in our respective data frames. R provides us with a shorthand to write formulas that include all the columns of a data frame as features, excluding the one chosen as the output. This is done using a single period, as the following code snippets show:

> machine_model1 <- lm(PRP ~ ., data = machine_train)
> cars_model1 <- lm(Price ~ ., data = cars_train)

Training a linear regression model may be a one-line affair once we have all our data prepared, but the important work comes straight after, when we study our model in order to determine how well we did. Fortunately, we can instantly obtain some important information about our model using the summary() function. The output of this function for our CPU data set is shown here:

> summary(machine_model1)

Call:
lm(formula...

Problems with linear regression

In this chapter, we've already seen some examples where trying to build a linear regression model might run into problems. One big class of problems that we've talked about is related to our model assumptions of linearity, feature independence, and the homoscedasticity and normality of errors. In particular we saw methods of diagnosing these problems either via plots, such as the residual plot, or by using functions that identify dependent components. In this section, we'll investigate a few more issues that can arise with linear regression.

Multicollinearity

As part of our preprocessing steps, we were diligent to remove features that were linearly related to each other. In doing this we were looking for an exact linear relationship and this is an example of perfect collinearity. Collinearity is the property that describes when two features are approximately in a linear relationship. This creates a problem for linear regression as we are trying to assign separate...

Feature selection

Our CPU model only came with six features. Often, we encounter real-world data sets that have a very large number of features arising from a diverse array of measurements. Alternatively, we may have to come up with a large number of features when we aren't really sure what features will be important in influencing our output variable. Moreover, we may have categorical variables with many possible levels from which we are forced to create a large number of new indicator variables, as we saw in Chapter 1, Gearing Up for Predictive Modeling. When our scenario involves a large number of features, we often find that our output only depends on a subset of these. Given k input features, there are 2^k distinct subsets that we can form, so for even a moderate number of features, the space of subsets is too large for us to fully explore by fitting a model on each subset.

Tip

One easy way to understand why there are 2^k possible feature subsets is this: we can assign a unique identifying...

Regularization

Variable selection is an important process, as it tries to make models simpler to interpret, easier to train, and free of spurious associations by eliminating variables unrelated to the output. This is one possible approach to dealing with the problem of overfitting. In general, we don't expect a model to completely fit our training data; in fact, the problem of overfitting often means that it may be detrimental to our predictive model's accuracy on unseen data if we fit our training data too well. In this section on regularization, we'll study an alternative to reducing the number of variables in order to deal with overfitting. Regularization is essentially a process of introducing an intentional bias or constraint in our training procedure that prevents our coefficients from taking large values. As this is a process that tries to shrink the coefficients, the methods we'll look at are also known as shrinkage methods.

Ridge regression

When the number of parameters is very large...

Summary

In this chapter, we studied linear regression, a method that allows us to fit a linear model in a supervised learning setting where we have a number of input features and a single numeric output. Simple linear regression is the name given to the scenario where we have only one input feature, and multiple linear regression describes the case where we have multiple input features. Linear regression is very commonly used as a first approach to solving a regression problem. It assumes that the output is a linear weighted combination of the input features in the presence of an irreducible error component that is normally distributed and has zero mean and constant variance. The model also assumes that the features are independent. The performance of linear regression can be assessed by a number of different metrics from the more standard MSE to others, such as the R² statistic. We explored several model diagnostics and significance tests designed to detect problems from violated assumptions...