Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with R

Product type Book

Published in Jun 2015

Publisher

ISBN-13 9781783982806

Pages 414 pages

Edition 1st Edition

Languages

Concepts

Predictive Analytics

Authors (2):

Rui Miguel Forte

View More author details

Table of Contents (19) Chapters

Mastering Predictive Analytics with R

Credits

About the Author

Acknowledgments

About the Reviewers

www.PacktPub.com

Preface

Gearing Up for Predictive Modeling

Linear Regression

Logistic Regression

Neural Networks

Support Vector Machines

Tree-based Methods

Ensemble Methods

Probabilistic Graphical Models

Time Series Analysis

Topic Modeling

Recommendation Systems

Index

Chapter 3. Logistic Regression

For regression tasks where the goal is to predict a numerical output, such as price or temperature, we've seen that linear regression can potentially be a good starting point. It is simple to train and easy to interpret even though, as a model, it makes strict assumptions about the data and the underlying target function. Before studying more advanced techniques to tackle regression problems, we'll introduce logistic regression. Despite its somewhat misleading name, this is actually our first model for performing classification. As we learned in Chapter 1, Gearing Up for Predictive Modeling, in classification problems, our output is qualitative and is thus comprised of a finite set of values, which we call classes. We'll begin by thinking about the binary classification scenario, where we are trying to distinguish between two classes, which we'll arbitrarily label as 0 and 1, and later on, we'll extend this to distinguishing between multiple classes.

Classifying with linear regression

Even though we know classification problems involve qualitative outputs, it seems natural to ask whether we could use our existing knowledge of linear regression and apply it to the classification setting. We could do this by training a linear regression model to predict a value in the interval [0,1], remembering that we've chosen to label our two classes as 0 and 1. Then, we could apply a threshold to the output of our model in such a way that if the model outputs a value below 0.5, we would predict class 0; otherwise, we would predict class 1. The following graph demonstrates this concept for a simple linear regression with a single input feature X₁ and for a binary classification problem. Our output variable y is either 0 or 1, so all the data lies on two horizontal lines. The solid line shows the output of the model, and the dashed line shows the decision boundary, which arises when we put a threshold on the model's predicted output at the value 0.5...

Introduction to logistic regression

In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1]. The form of the logistic function is as follows:

Here is a plot of the logistic function:

When x = 0, the logistic function takes the value 0.5. As x tends to +∞, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -∞, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability.

Generalized linear models

Logistic regression belongs to a class of models known as generalized linear models (GLMs...

Predicting heart disease

We'll put logistic regression for the binary classification task to the test with a real-world data set from the UCI Machine Learning Repository. This time, we will be working with the Statlog (Heart) data set, which we will refer to as the heart data set henceforth for brevity. The data set can be downloaded from the UCI Machine Repository's website at http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29. The data contain 270 observations for patients with potential heart problems. Of these, 120 patients were shown to have heart problems, so the split between the two classes is fairly even. The task is to predict whether a patient has a heart disease based on their profile and a series of medical tests. First, we'll load the data into a data frame and rename the columns according to the website:

> heart <- read.table("heart.dat", quote = "\"")
> names(heart) <- c("AGE", "SEX", "CHESTPAIN", "RESTBP", "CHOL", "SUGAR", "ECG", "MAXHR", "ANGINA", "DEP...

Assessing logistic regression models

The summary of the logistic regression model produced with the glm() function has a similar format to that of the linear regression model produced with the lm() function. This shows us that for our categorical variables, we have one fewer binary feature than the number of levels in the original variable, so for example, the three-valued THAL input feature produced two binary variables labeled THAL6 and THAL7. We'll begin by looking first at the regression coefficients that are predicted with our model. These are presented with their corresponding z-statistic. This is analogous to the t-statistic that we saw in linear regression, and again, the higher the absolute value of the z-statistic, the more likely it is that this particular feature is significantly related to our output variable. The p-values next to the z-statistic express this notion as a probability and are annotated with stars and dots, as they were in linear regression, indicating the smallest...

Regularization with the lasso

In the previous chapter on linear regression, we used the glmnet package to perform regularization with ridge regression and the lasso. As we've seen that it might be a good idea to remove some of our features, we'll try applying lasso to our data set and assess the results. First, we'll train a series of regularized models with glmnet() and then we will use cv.glmnet() to estimate a suitable value for λ. Then, we'll examine the coefficients of our regularized model using this λ:

> library(glmnet)
> heart_train_mat <- model.matrix(OUTPUT ~ ., heart_train)[,-1]
> lambdas <- 10 ^ seq(8, -4, length = 250)
> heart_models_lasso <- glmnet(heart_train_mat, 
  heart_train$OUTPUT, alpha = 1, lambda = lambdas, family = "binomial")
> lasso.cv <- cv.glmnet(heart_train_mat, heart_train$OUTPUT, alpha = 1,lambda = lambdas, family = "binomial")
> lambda_lasso <- lasso.cv$lambda.min
> lambda_lasso
[1] 0.01057052

> predict(heart_models_lasso...

Classification metrics

Although we looked at the test set accuracy for our model, we know from Chapter 1, Gearing Up for Predictive Modeling, that the binary confusion matrix can be used to compute a number of other useful performance metrics for our data, such as precision, recall, and the F measure.

We'll compute these for our training set now:

> (confusion_matrix <- table(predicted = train_class_predictions, actual = heart_train$OUTPUT))
         actual
predicted   0   1
        0 118  16
        1  10  86
> (precision <- confusion_matrix[2, 2] / sum(confusion_matrix[2,]))
[1] 0.8958333
> (recall <- confusion_matrix[2, 2] / sum(confusion_matrix[,2]))
[1] 0.8431373
> (f = 2 * precision * recall / (precision + recall))
[1] 0.8686869

Here, we used the trick of bracketing our assignment statements to simultaneously assign the result of an expression to a variable and print out the value assigned. Now, recall is the ratio of correctly identified instances of class 1, divided...

Extensions of the binary logistic classifier

So far, the focus of this chapter has been on the binary classification task where we have two classes. We'll now turn to the problem of multiclass prediction. In Chapter 1, Gearing Up for Predictive Modeling, we studied the iris data set, where the goal is to distinguish between three different species of iris, based on features that describe the external appearance of iris flower samples. Before presenting additional examples of multiclass problems, we'll state an important caveat. The caveat is that several other methods for classification that we will study in this book, such as neural networks and decision trees, are both more natural and more commonly used than logistic regression for classification problems involving more than two classes. With that in mind, we'll turn to multinomial logistic regression, our first extension of the binary logistic classifier.

Multinomial logistic regression

Suppose our target variable is comprised of K classes...

Summary

Logistic regression is the prototypical method for solving classification problems, just as linear regression was the prototypical example of a model to solve regression problems. In this chapter, we demonstrated why logistic regression offers a better way of approaching classification problems compared to linear regression with a threshold, by showing that the least squares criterion is not the most appropriate criterion to use when trying to separate two classes. We presented the notion of likelihood and its maximization as the basis for training a model. This is a very important concept that features time and again in various machine learning contexts. Logistic regression is an example of a generalized linear model. This is a model that relates the output variable to a linear combination of input features via a link function, which we saw was the logit function in this case. For the binary classification problem, we used R's glm() function to perform logistic regression on a real...