Home

Data

Hands-On Ensemble Learning with R

By Prabhanjan Narayanachar Tattar

Book

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $39.99 $27.98

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Ensemble techniques are used for combining two or more similar or dissimilar machine learning algorithms to create a stronger model. Such a model delivers superior prediction power and can give your datasets a boost in accuracy. Hands-On Ensemble Learning with R begins with the important statistical resampling methods. You will then walk through the central trilogy of ensemble techniques – bagging, random forest, and boosting – then you'll learn how they can be used to provide greater accuracy on large datasets using popular R packages. You will learn how to combine model predictions using different machine learning algorithms to build ensemble models. In addition to this, you will explore how to improve the performance of your ensemble models. By the end of this book, you will have learned how machine learning algorithms can be combined to reduce common problems and build simple efficient ensemble models with the help of real-world examples.

Publication date:: July 2018
Publisher: Packt
Pages: 376
ISBN: 9781788624145
Download code from GitHub

Chapter 1. Introduction to Ensemble Techniques

Ensemble techniques are model output aggregating techniques that have evolved over the past decade and a half in the area of statistical and machine learning. This forms the central theme of this book. Any user of statistical models and machine learning tools will be familiar with the problem of building a model and the vital decision of choosing among potential candidate models. A model's accuracy is certainly not the only relevant criterion; we are also concerned with its complexity, as well as whether or not the overall model makes practical sense.

Common modeling problems include the decision to choose a model, and various methodologies exist to aid this task. In statistics, we resort to measures such as Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC), and on other fronts, the p-value associated with the variable in the fitted model helps with the decision. This is a process generally known as model selection. Ridge penalty, Lasso, and other statistics also help with this task. For machine learning models such as neural networks, decision trees, and so on, a k-fold cross-validation is useful when the model is built using a part of the data referred to as training data, and then accuracy is looked for in the untrained area or validation data. If the model is sensitive to its complexity, the exercise could be futile.

The process of obtaining the best model means that we create a host of other models, which are themselves nearly as efficient as the best model. Moreover, the best model accurately covers the majority of samples, and other models might accurately assess the variable space region where it is inaccurate. Consequently, we can see that the final shortlisted model has few advantages over the runner up. The next models in line are not so poor as to merit outright rejection. This makes it necessary to find a way of taking most of the results already obtained from the models and combining them in a meaningful way. The search for a method for putting together various models is the main objective of ensemble learning. Alternatively, one can say that ensemble learning transforms competing models into collaborating models. In fact, ensemble techniques are not the end of the modeling exercise, as they will also be extended to the unsupervised learning problems. We will demonstrate an example that justifies the need for this.

The implementation of ensemble methods would have been impossible without the invention of modern computational power. Statistical methods foresaw techniques that required immense computations. Methods such as permutation tests and jackknife are evidence of the effectiveness of computational power. We will undertake an exercise to learn these later in the chapter, and we will revisit them later on in the book.

From a machine learning perspective, supervised and unsupervised are the two main types of learning technique. Supervised learning is the arm of machine learning, the process in which a certain variable is known, and the purpose is to understand this variable through various other variables. Here, we have a target variable. Since learning takes place with respect to the output variable, supervised learning is sometimes referred to as learning with a teacher. All target variables are not alike, and they often fall under one of the following four types. If the goal is to classify observations into one of k types of class (for example, Yes/No, Satisfied/Dissatisfied), then we have a classification problem. Such a variable is referred to as a categorical variable in statistics. It is possible that the variable of interest might be a continuous variable, which is numeric from a software perspective. This may include car mileage per liter, a person's income, or a person's age. For such scenarios, the purpose of the machine learning problem is to learn the variables in terms of other associated variables, and then predict it for unknown cases in which only the values of associated variables are available. We will broadly refer to this class of problem as a regression problem.

In clinical trials, the time to event is often of interest. When an illness is diagnosed, we would ask whether the proposed drug is an improvement on the existing one. While the variable in question here is the length of time between diagnosis and death, clinical trial data poses several other problems. The analysis cannot wait until all the patients have died, and/or some of the patients may have moved away from the study, making it no longer possible to know their status. Consequently, we have censored data. As part of the study observations, complete information is not available. Survival analysis largely deals with such problems, and we will undertake the problem of creating ensemble models here.

With classification, regression, and survival data, it may be assumed that that the instances/observations are independent of each other. This is a very reasonable assumption in that there is a valid reason to believe that patients will respond to a drug independently of other patients, a customer will churn or pay the loan independently of other customers, and so forth. In yet another important class of problems, this assumption is not met, and we are left with observations depending on each other via time series data. An example of time series data is the closure stock exchange points of a company. Clearly, the performance of a company's stock can't be independent each day, and thus we need to factor in dependency.

In many practical problems, the goal is to understand patterns or find groups of observations, and we don't have a specific variable of interest with regard to which algorithm needs to be trained. Finding groups or clusters is referred to as unsupervised learning or learning without a teacher. Two main practical problems that arise in finding clusters is that (i) it is generally not known in advance how many clusters are in the population, and (ii) different choices of initial cluster centers lead to different solutions. Thus, we need a solution that is free from, or at least indifferent to, initialization and takes the positives of each useful solution into consideration. This will lead us toward unsupervised ensemble techniques.

The search for the best models, supervised or unsupervised, is often hindered by the presence of outliers. The presence of a single outlier is known to heavily influence the overall fit of linear models, and it is also known to significantly impact even nonlinear models. Outlier detection is a challenge in itself, and a huge body of statistical methods help in identifying outliers. A host of machine learning methods also help in identifying outliers. Of course, ensembles will help here, and we will develop R programs that will help solve the problem of identifying outliers. This method will be referred to as outlier ensembles.

At the outset, it is important that the reader becomes familiar with the datasets used in this book. All major datasets will be introduced in the first section. We begin the chapter with a brief introduction to the core statistical/machine learning models and put them into action immediately afterward. It will quickly become apparent that there is not a single class of model that would perform better than any other model. If any such solution existed, we wouldn't need the ensemble technique.

In this chapter, we will cover:

Datasets: The core datasets that will be used throughout the book
Statistical/machine learning models: Important classification models will be explained here
The right model dilemma: The absence of a dominating model
An ensemble purview: The need for ensembles
Complementary statistical tests: Important statistical tests that will be useful for model comparisons will be discussed here

The following R packages will be required for this chapter:

ACSWR
caret
e1071
factoextra
mlbench
NeuralNetTools
perm
pROC
RSADBE
Rpart
survival
nnet

Datasets

Data is undoubtedly the most important component of machine learning. If there was no data, we wouldn't have a common purpose. In most cases, the purpose for which the data is collected defines the problem itself. As we know that the variable might be of several types, the way it is stored and organized is also very important.

Lee and Elder (1997) considered a series of datasets and introduced the need for ensemble models. We will begin by looking at the details of the datasets considered in their paper, and we will then refer to other important datasets later on in the book.

Hypothyroid

The hypothyroid dataset Hypothyroid.csv is available in the book's code bundle packet, located at /…/Chapter01/Data. While we have 26 variables in the dataset, we will only be using seven of these variables. Here, the number of observations is n = 3163. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets/thyroid+disease and the filename is hypothyroid.data (http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data). After some tweaks to the order of relabeling certain values, the CSV file is made available in the book's code bundle. The purpose of the study is to classify a patient with a thyroid problem based on the information provided by other variables. There are multiple variants of the dataset and the reader can delve into details at the following web page: http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HELLO. Here, the column representing the variable of interest is named Hypothyroid, which shows that we have 151 patients with thyroid problems. The remaining 3012 tested negative for it. Clearly, this dataset is an example of unbalanced data, which means that one of the two cases is outnumbered by a huge number; for each thyroid case, we have about 20 negative cases. Such problems need to be handled differently, and we need to get into the subtleties of the algorithms to build meaningful models. The additional variables or covariates that we will use while building the predictive models include Age, Gender, TSH, T3, TT4, T4U, and FTI. The data is first imported into an R session and is subset according to the variables of interest as follows:

> HT <- read.csv("../Data/Hypothyroid.csv",header = TRUE,stringsAsFactors = F)
> HT$Hypothyroid <- as.factor(HT$Hypothyroid)
> HT2 <- HT[,c("Hypothyroid","Age","Gender","TSH","T3","TT4","T4U","FTI")]

The first line of code imports the data from the Hypothyroid.csv file using the read.csv function. The dataset now has a lot of missing data in the variables, as seen here:

> sapply(HT2,function(x) sum(is.na(x)))
Hypothyroid         Age      Gender         TSH          T3         TT4 
          0         446          73         468         695         249 
        T4U         FTI 
        248         247

Consequently, we remove all the rows that have a missing value, and then split the data into training and testing datasets. We will also create a formula for the classification problem:

> HT2 <- na.omit(HT2)
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(HT2),replace=TRUE, prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> HT2_Train <- HT2[Train_Test=="Train",]
> HT2_TestX <- within(HT2[Train_Test=="Test",],rm(Hypothyroid))
> HT2_TestY <- HT2[Train_Test=="Test",c("Hypothyroid")]
> HT2_Formula <- as.formula("Hypothyroid~.")

The set.seed function ensures that the results are reproducible each time we run the program. After removing the missing observations with the na.omit function, we split the hypothyroid data into training and testing parts. The former is used to build the model and the latter is used to validate it, using data that has not been used to build the model. Quinlan – the inventor of the popular tree algorithm C4.5 – used this dataset extensively.

Waveform

This dataset is an example of a simulation study. Here, we have twenty-one variables as input or independent variables, and a class variable referred to as classes. The data is generated using the mlbench.waveform function from the mlbench R package. For more details, refer to the following link: ftp://ftp.ics.uci.edu/pub/machine-learning-databases. We will simulate 5,000 observations for this dataset. As mentioned earlier, the set.seed function guarantees reproducibility. Since we are solving binary classification problems, we will reduce the three classes generated by the waveform function to two, and then partition the data into training and testing parts for model building and testing purposes:

> library(mlbench)
> set.seed(123)
> Waveform <- mlbench.waveform(5000)
> table(Waveform$classes)
   1    2    3 
1687 1718 1595 
> Waveform$classes <- ifelse(Waveform$classes!=3,1,2)
> Waveform_DF <- data.frame(cbind(Waveform$x,Waveform$classes)) # Data Frame
> names(Waveform_DF) <- c(paste0("X",".",1:21),"Classes")
> Waveform_DF$Classes <- as.factor(Waveform_DF$Classes)
> table(Waveform_DF$Classes)
   1    2 
3405 1595

The R function mlbench.waveform creates a new object of the mlbench class. Since it consists of two sub-parts in x and classes, we will convert it into data.frame following some further manipulations. The cbind function binds the two objects x (a matrix) and classes (a numeric vector) into a single matrix. The data.frame function converts the matrix object into a data frame, which is the class desired for the rest of the program.

After partitioning the data, we will create the required formula for the waveform dataset:

> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(Waveform_DF),replace = TRUE,
+ prob = c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> Waveform_DF_Train <- Waveform_DF[Train_Test=="Train",]
> Waveform_DF_TestX <- within(Waveform_DF[Train_Test=="Test",],rm(Classes))
> Waveform_DF_TestY <- Waveform_DF[Train_Test=="Test","Classes"]
> Waveform_DF_Formula <- as.formula("Classes~.")

German Credit

Loans are not always repaid in full, and there are defaulters. In this case, it becomes important for the bank to identify potential defaulters based on the available information. Here, we adapt the GC dataset from the RSADBE package to properly reflect the labels of the factor variable. The transformed dataset is available as GC2.RData in the data folder. The GC dataset itself is mainly an adaptation of the version available at https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). Here, we have 1,000 observations, and 20 covariate/independent variables such as the status of existing checking account, duration, and so forth. The final status of whether the loan was completely paid or not is available in the good_bad column. We will partition the data into training and testing parts, and create the formula too:

> library(RSADBE)
> load("../Data/GC2.RData")
> table(GC2$good_bad)
 bad good 
 300  700 
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(GC2),replace = TRUE,prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> GC2_Train <- GC2[Train_Test=="Train",]
> GC2_TestX <- within(GC2[Train_Test=="Test",],rm(good_bad))
> GC2_TestY <- GC2[Train_Test=="Test","good_bad"]
> GC2_Formula <- as.formula("good_bad~.")

Iris

Iris is probably the most famous classification dataset. The great statistician Sir R. A. Fisher popularized the dataset, which he used for classifying the three types of iris plants based on length and width measurements of their petals and sepals. Fisher used this dataset to pioneer the invention of the statistical classifier linear discriminant analysis. Since there are three species of iris, we converted this into a binary classification problem, separated the dataset, and created a formula as seen here:

> data("iris")
> ir2 <- iris
> ir2$Species <- ifelse(ir2$Species=="setosa","S","NS")
> ir2$Species <- as.factor(ir2$Species)
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(ir2),replace = TRUE,prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> ir2_Train <- ir2[Train_Test=="Train",]
> ir2_TestX <- within(ir2[Train_Test=="Test",],rm(Species))
> ir2_TestY <- ir2[Train_Test=="Test","Species"]
> ir2_Formula <- as.formula("Species~.")

Pima Indians Diabetes

Diabetes is a health hazard, which is mostly incurable, and patients who are diagnosed with it have to adjust their lifestyles in order to cater to this condition. Based on variables such as pregnant, glucose, pressure, triceps, insulin, mass, pedigree, and age, the problem here is to classify the person as diabetic or not. Here, we have 768 observations. This dataset is drawn from the mlbench package:

> data("PimaIndiansDiabetes")
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(PimaIndiansDiabetes),replace = TRUE,
+ prob = c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> PimaIndiansDiabetes_Train <- PimaIndiansDiabetes[Train_Test=="Train",]
> PimaIndiansDiabetes_TestX <- within(PimaIndiansDiabetes[Train_Test=="Test",],
+                                     rm(diabetes))
> PimaIndiansDiabetes_TestY <- PimaIndiansDiabetes[Train_Test=="Test","diabetes"]
> PID_Formula <- as.formula("diabetes~.")

The five datasets described up to this point are classification problems. We look at one example each for regression, time series, survival, clustering, and outlier detection problems.

US Crime

A study of the crime rate per million of the population among the 47 different states of the US is undertaken here, and an attempt is made to find its dependency on 13 variables. These include age distribution, indicator of southern states, average number of schooling years, and so on. As with the earlier datasets, we will also partition this one into the following chunks of R program:

> library(ACSWR)
Warning message:
package 'ACSWR' was built under R version 3.4.1 
> data(usc)
> str(usc)
'data.frame':	47 obs. of  14 variables:
 $ R  : num  79.1 163.5 57.8 196.9 123.4 ...
 $ Age: int  151 143 142 136 141 121 127 131 157 140 ...
 $ S  : int  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed : int  91 113 89 121 121 110 111 109 90 118 ...
 $ Ex0: int  58 103 45 149 109 118 82 115 65 71 ...
 $ Ex1: int  56 95 44 141 101 115 79 109 62 68 ...
 $ LF : int  510 583 533 577 591 547 519 542 553 632 ...
 $ M  : int  950 1012 969 994 985 964 982 969 955 1029 ...
 $ N  : int  33 13 18 157 18 25 4 50 39 7 ...
 $ NW : int  301 102 219 80 30 44 139 179 286 15 ...
 $ U1 : int  108 96 94 102 91 84 97 79 81 100 ...
 $ U2 : int  41 36 33 39 20 29 38 35 28 24 ...
 $ W  : int  394 557 318 673 578 689 620 472 421 526 ...
 $ X  : int  261 194 250 167 174 126 168 206 239 174 ...
> set.seed(12345)
> Train_Test <- sample(c("Train","Test"),nrow(usc),replace = TRUE,prob=c(0.7,0.3))
> head(Train_Test)
[1] "Test"  "Test"  "Test"  "Test"  "Train" "Train"
> usc_Train <- usc[Train_Test=="Train",]
> usc_TestX <- within(usc[Train_Test=="Test",],rm(R))
> usc_TestY <- usc[Train_Test=="Test","R"]
> usc_Formula <- as.formula("R~.")

In each example discussed in this section thus far, we had a reason to believe that the observations are independent of each other. This assumption simply means that the regressands and regressors of one observation have no relationship with other observations' regressands and regressors. This is a simple and reasonable assumption. We have another class of observations/datasets where such assumptions are not practical. For example, the maximum temperature of a day is not completely independent of the previous day's temperature. If that were to be the case, we could have a scorchingly hot day, followed by winter, followed by another hot day, which in turn is followed by a very heavy rainy day. However, weather does not happen in this way as on successive days, the weather is dependent on previous days. In the next example, we consider the number of overseas visitors to New Zealand.

Overseas visitors

The New Zealand overseas dataset is dealt with in detail in Chapter 10 of Tattar, et al. (2017). Here, the number of overseas visitors is captured on a monthly basis from January 1977 to December 1995. We have visitors' data available for over 228 months. The osvisit.dat file is available at multiple web links, including https://www.stat.auckland.ac.nz/~ihaka/courses/726-/osvisit.dat and https://github.com/AtefOuni/ts/blob/master/Data/osvisit.dat. It is also available in the book's code bundle. We will import the data in R, convert it into a time series object, and visualize it:

> osvisit <- read.csv("../Data/osvisit.dat", header= FALSE)
> osv <- ts(osvisit$V1, start = 1977, frequency = 12)
> class(osv)
[1] "ts"
> plot.ts(osv)

Figure 1: New Zealand overseas visitors

Here, the dataset is not partitioned! Time series data can't be arbitrarily partitioned into training and testing parts. The reason is quite simple: if we have five observations in a time sequential order y1, y2, y3, y4, y5, and we believe that the order of impact is y1→y2→y3→y4→y5, an arbitrary partition of y1, y2, y5, will have different behavior. It won't have the same information as three consecutive observations. Consequently, the time series partitioning has to preserve the dependency structure; we keep the most recent part of the time as the test data. For the five observations example, we choose a sample of y1, y2, y3, as the test data. The partitioning is simple, and we will cover this in Chapter 11, Ensembling Time Series Models.

Live testing experiments rarely yield complete observations. In reliability analysis, as well as survival analysis/clinical trials, the units/patients are observed up to a predefined time and a note is made regarding whether a specific event occurs, which is usually failure or death. A considerable fraction of observations would not have failed by the pre-decided time, and the analysis cannot wait for all units to fail. A reason to curtail the study might be that the time by which all units would have failed would be very large, and it would be expensive to continue the study until such a time. Consequently, we are left with incomplete observations; we only know that the lifetime of the units lasts for at least the predefined time before the study was called off, and the event of interest may occur sometime in the future. Consequently, some observations are censored and the data is referred to as censored data. Special statistical methods are required for the analysis of such datasets. We will give an example of these types of datasets next, and analyze them later, in Chapter 10, Ensembling Survival Models.

Primary Biliary Cirrhosis

The pbc dataset from the survival package is a benchmark dataset in the domain of clinical trials. Mayo Clinic collected the data, which is concerned with the primary biliary cirrhosis (PBC) of the liver. The study was conducted between 1974 and 1984. More details can be found by running pbc, followed by library(survival) on the R terminal. Here, the main time to the event of interest is the number of days between registration and either death, transplantation, or study analysis in July 1986, and this is captured in the time variable. Similarly to a survival study, the events might be censored and the indicator is in the column status. The time to event needs to be understood, factoring in variables such as trt, age, sex, ascites, hepato, spiders, edema, bili, chol, albumin, copper, alk.phos, ast, trig, platelet, protime, and stage.

The eight datasets discussed up until this point have a target variable, or a regressand/dependent variable, and are examples of the supervised learning problem. On the other hand, there are practical cases in which we simply attempt to understand the data and find useful patterns and groups/clusters in it. Of course, it is important to note that the purpose of clustering is to find an identical group and give it a sensible label. For instance, if we are trying to group cars based on their characteristics such as length, width, horsepower, engine cubic capacity, and so on, we may find groups that might be labeled as hatch, sedan, and saloon classes, while another clustering solutions might result in labels of basic, premium, and sports variant groups. The two main problems posed in clustering are the choice of the number of groups and the formation of robust clusters. We consider a simple dataset from the factoextra R package.

Multishapes

The multishapes dataset from the factoextra package consists of three variables: x, y, and shape. It consists of different shapes, with each shape forming a cluster. Here, we have two concurrent circle shapes, two parallel rectangles/beds, and one cluster of points at the bottom-right. Outliers are also added across scatterplots. Some brief R code gives a useful display:

> library(factoextra)
> data("multishapes")
> names(multishapes)
[1] "x"     "y"     "shape"
> table(multishapes$shape)
  1   2   3   4   5   6 
400 400 100 100  50  50 
> plot(multishapes[,1],multishapes[,2],col=multishapes[,3])

Figure 2: Finding shapes or groups

This dataset includes a column named shape, as it is a hypothetical dataset. In true clustering problems, we will have neither a cluster group indicator nor the visualization luxury of only two variables. Later in this book, we will see how ensemble clustering techniques help overcome the problems of deciding the number of clusters and the consistency of cluster membership.

Although it doesn't happen that often, frustrations can arise when fine-tuning different parameters, fitting different models, and other tricks all fail to find a useful working model. The culprit of this is often the outlier. A single outlier is known to wreak havoc on an otherwise potentially useful model, and their detection is of paramount importance. Hitherto this, the parametric and nonparametric outlier detections would be a matter of deep expertise. In complex scenarios, the identification would be an insurmountable task. A consensus on an observation being an outlier can be achieved using the ensemble outlier framework. To consider this, the board stiffness dataset will be considered. We will see how an outlier is pinned down in the conclusion of this book.

Board Stiffness

The board stiffness dataset is available in the ACSWR package through the stiff data.frame stiff. The dataset consists of four measures of stiffness for 30 boards. The first measure of stiffness is obtained by sending a shock wave down the board, the second measure is obtained by vibrating the board, and the remaining two are obtained from static tests. A quick method of identifying the outliers in a multivariate dataset is by using the Mahalanobis distance function. The further the distance an observation is from the center, the more likely it is that the observation will be an outlier:

> data(stiff)
> sort(mahalanobis(stiff,colMeans(stiff),cov(stiff)),decreasing = TRUE)
 [1] 16.8474070168 12.2647549939  9.8980384087  7.6166439053
 [5]  6.2837628235  5.4770195915  5.2076098038  5.0557446013
 [9]  4.9883497928  4.5767867224  3.9900602512  3.5018290410
[13]  3.3979804418  2.9951752177  2.6959023813  2.5838186338
[17]  2.5385575365  2.3816049840  2.2191408683  1.9307771418
[21]  1.4876569689  1.4649908273  1.3980776252  1.3632123553
[25]  1.0792484215  0.7962095966  0.7665399704  0.6000128595
[29]  0.4635158597  0.1295713581

Statistical/machine learning models

The previous section introduced a host of problems through real datasets, and we will now discuss some standard model variants that are useful for dealing with such problems. First, we set up the required mathematical framework.

Suppose that we have n independent pairs of observations, , where denotes the random variable of interest, also known as the dependent variable, regress and, endogenous variable, and so on. is the associated vector of explanatory variables, or independent/exogenous variables. The explanatory vector will consist of k elements, that is, . The data realized is of the form , where is the realized value (data) of random variable . A convention will be adapted throughout the book that , and this will take care of the intercept term. We assume that the observations are from the true distribution F, which is not completely known. The general regression model, including the classification model as well as the regression model, is specified by:

Here, the function f is an unknown function and is the regression parameter, which captures the influence of on . The error is the associated unobservable error term. Diverse methods can be applied to model the relationship between the Ys and the xes. The statistical regression model focused on the complete specification of the error distribution , and in general the functional form would be linear as in . The function is the link function in the class of generalized linear models. Nonparametric and semiparametric regression models are more flexible, as we don't place a restriction on the error's probability distribution. Flexibility would come with a price though, and here we need a much higher number of observations to make a valid inference, although that number is unspecified and is often subjective.

The machine learning paradigm includes some black box methods, and we have a healthy overlap between this paradigm and non- and semi-parametric models. The reader is also cautioned that black box does not mean unscientific in any sense. The methods have a firm mathematical foundation and are reproducible every time. Next, we quickly review some of the most important statistical and machine learning models, and illustrate them through the datasets discussed earlier.

Logistic regression model

The logistic regression model is a binary classification model, and it is a member of the exponential family which belongs to the class of generalized linear models. Now, let denote the binary label:

Using the information contained in the explanatory vector we are trying to build a model that will help in this task. The logistic regression model is the following:

Here, is the vector of regression coefficients. Note that the logit function is linear in the regression coefficients and hence the name for the model is a logistic regression model. A logistic regression model can be equivalently written as follows:

Here, is the binary error term that follows a Bernoulli distribution. For more information, refer to Chapter 17 of Tattar, et al. (2016). The estimation of the parameters of the logistic regression requires the iterative reweighted least squares (IRLS) algorithm, and we would use the glm R function to get this task done. We will use the Hypothyroid dataset in this section. In the previous section, the training and test datasets and formulas were already created, and we will carry on from that point.

Logistic regression for hypothyroid classification

For the hypothyroid dataset, we had HT2_Train as the training dataset. The test dataset is split as the covariate matrix in HT2_TestX and the outputs of the test dataset in HT2_TestY, while the formula for the logistic regression model is available in HT2_Formula. First, the logistic regression model is fitted to the training dataset using the glm function and the fitted model is christened LR_fit, and then we inspect it for model fit summaries using summary(LR_fit). The fitted model is then applied to the covariate data in the test part using the predict function to create LR_Predict. The predicted probabilities are then labeled in LR_Predict_Bin, and these labels are compared with the actual testY_numeric and overall accuracy is obtained:

> ntr <- nrow(HT2_Train) # Training size
> nte <- nrow(HT2_TestX) # Test size
> p <- ncol(HT2_TestX)
> testY_numeric <- as.numeric(HT2_TestY)
> LR_fit <- glm(HT2_Formula,data=HT2_Train,family = binomial())
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(LR_fit)
Call:
glm(formula = HT2_Formula, family = binomial(), data = HT2_Train)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6390   0.0076   0.0409   0.1068   3.5127  
Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.302025   2.365804  -3.509 0.000449 ***
Age         -0.024422   0.012145  -2.011 0.044334 *  
GenderMALE  -0.195656   0.464353  -0.421 0.673498    
TSH         -0.008457   0.007530  -1.123 0.261384    
T3           0.480986   0.347525   1.384 0.166348    
TT4         -0.089122   0.028401  -3.138 0.001701 ** 
T4U          3.932253   1.801588   2.183 0.029061 *  
FTI          0.197196   0.035123   5.614 1.97e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 609.00  on 1363  degrees of freedom
Residual deviance: 181.42  on 1356  degrees of freedom
AIC: 197.42
Number of Fisher Scoring iterations: 9
> LR_Predict <- predict(LR_fit,newdata=HT2_TestX,type="response")
> LR_Predict_Bin <- ifelse(LR_Predict>0.5,2,1)
> LR_Accuracy <- sum(LR_Predict_Bin==testY_numeric)/nte
> LR_Accuracy
[1] 0.9732704

It can be seen from the summary of the fitted GLM (the output following the line summary(LR_fit)) that we are having four significant variables in Age, TT4, T4U, and FTI. Using the predict function, we apply the fitted model on unknown test cases in HT2_TestX, compare it with the actuals, and find the accuracy to be 97.33%. Consequently, logistic regression is easily deployed in the R software.

Neural networks

Logistic regression might appear restricted as it allows only a linear impact of the covariates through the link function. The linearity assumption might not hold, and in most practical cases, we don't have enough information to specify the functional form of the nonlinear relationship. Thus, all we know is that there is most likely an unknown nonlinear relationship. Neural networks are the nonlinear generalization of logistic regression, and this involves two important components: hidden neurons and learning rate. We will revise the structure of neural networks first.

In a neural network, the input variables are considered the first layer of neurons and the output the final and concluding layer of neurons. The structure of a neural network model can be visualized using the R package NeuralNetTools. Suppose that we have three input variables and two hidden layers, and each contains two hidden neurons. Here, we have a neural network with four layers. The next code segment gives a visualization of a neural network's structure with three input variables, two hidden neurons in two hidden layers, and one output variable:

> library(NeuralNetTools) 
> plotnet(rep(0,17),struct=c(3,2,2,1))
> title("A Neural Network with Two Hidden Layers")

We find the R package NeuralNetTools very useful in visualizing the structure of a neural network. Neural networks built using the core R package nnet can also be visualized using the NeuralNetTools::plotnet function. The plotnet function sets up a neural network whose structure consists of three neurons in the first layer, two neurons in each of the second and third layers, and one in the final output layer, through the struct option. The weights along the arcs are set at zero in rep(0,17):

Figure 3: Structure of a neural network

In the previous diagram, we have four layers of the neural network. The first layer consists of B1 (the bias), I1 (X1), I2 (X2), and I3 (X3). The second layer consists of three neurons in B2 (the bias of the first hidden layer), H1, and H2. Note that the bias B2 does not receive any input from the first hidden layer. Next, each neuron receives an overall input from each of the neurons of the previous layer, which are B1, X1, X2, and X3 here. However, H1 and H2 of the first hidden layer will receive different aggregated input from B1, X1, X2, and X3. Appropriate weights are in action on each of the arcs of the network and it is the weights that form the parameters of the neural networks; that is, the arrival of H1 (of the first layer) would be like

and the effective arrival is through a transfer function. A transfer function might be an identity function, sigmoidal function, and so on. Similarly, the arrival at the second neuron of the first layer is

. By extension, B2, H1, and H2 (of the first layer) will be the input for the second hidden layer, and B3, H1, and H2 will be the input for the final output. At each stage of the neural network, we have weights. The weights need to be determined in such a manner that the difference between predicted output O1 and the true Y1 is as small as possible. Note that the logistic regression is a particular case of the neural network as can be seen by directly removing all hidden layers and input layer leads in the output one. The neural network will be fitted for the hypothyroid problem.

Neural network for hypothyroid classification

We use the nnet function from the package of the same name to set up the neural network for the hypothyroid classification problem. The formula, training, and test datasets continue as before. The accuracy calculation follows along similar lines to the segment in logistic regression. The fitted neural network is visualized using the plotnet graphical function from the NeuralNetTools package:

> set.seed(12345)
> NN_fit <- nnet(HT2_Formula,data = HT2_Train,size=p,trace=FALSE)
> NN_Predict <- predict(NN_fit,newdata=HT2_TestX,type="class")
> NN_Accuracy <- sum(NN_Predict==HT2_TestY)/nte
> NN_Accuracy
[1] 0.9827044025
> plotnet(NN_fit)
> title("Neural Network for Hypothyroid Classification")

Here, the accuracy is 98.27%, which is an improvement on the logistic regression model. The visual display of the fitted model is given in the following diagram. We have fixed the seed for the random initialization of the neural network parameters at 12345, using set.seed(12345), so that the results are reproducible at the reader's end. This is an interesting case for ensemble modeling. Different initial seeds – which the reader can toy around with – will lead to different accuracies. Sometimes, you will get an accuracy lower than any of the models considered in this section, and at other times you will get the highest accuracy. The choice of seed as arbitrary leads to the important question of which solution is useful. Since the seeds are arbitrary, the question of a good seed or a bad seed does not arise. In this case, if a model is giving you a higher accuracy, it does not necessarily mean anything:

Figure 4: Neural network for the hypothyroid classification

Naïve Bayes classifier

The naïve Bayes classifier is a simplistic implementation based on the Bayes formula. It is based on simple empirical and conditional probabilities, as evidenced in the actual data. Beyond the simplest assumption of observation independence, we don't have any restrictions in using this model.

Naïve Bayes for hypothyroid classification

A naïve Bayes classifier is fit using the naiveBayes function from the e1071 R package. The prediction and accuracy assessment is carried out using two functions, predict and sum:

> NB_fit <- naiveBayes(HT2_Formula,data=HT2_Train)
> NB_predict <- predict(NB_fit,newdata=HT2_TestX)
Warning message:
In data.matrix(newdata) : NAs introduced by coercion
> NB_Accuracy <- sum(NB_predict==HT2_TestY)/nte
> NB_Accuracy
[1] 0.9732704403

The accuracy of the naïve Bayes classifier is 97.33%, which is the same as the logistic regression model and less than the one provided by the neural network. We remark here that it is only a coincidence that the accuracy of this method and logistic regression is the same.

Decision tree

Breiman and Quinlan mainly developed decision trees, which have evolved a lot since the 1980s. If the dependent variable is continuous, the decision tree will be a regression tree and if it is categorical variable, it will be a classification tree. Of course, we can have a survival tree as well. Decision trees will be the main model that will be the beneficiary of the ensemble technique, as will be seen throughout the book.

Consider the regression tree given in the following diagram. We can see that there are three input variables, which are , and the output variable is Y. Strictly speaking, a decision tree will not display all the variables used to build the tree. In this tree structure, a decision tree is conventionally displayed upside down. We have four terminal nodes. If the condition is satisfied, we move to the right side of the tree and conclude that the average Y value is 40. If the condition is not satisfied, we move to the left, and check whether . If this condition is not satisfied, we move to the left side of the tree and conclude that the average Y value is 100. Upon the satisfactory meeting of this condition, we move to the right side and then if the categorical variable , the average Y value would be 250, or 10 otherwise. This decision tree can be captured in the form of an equation too, as follows:

Figure 5: Regression tree

The statistician Terry Therneau developed the rpart R package.

Decision tree for hypothyroid classification

Using the rpart function from the rpart package, we build a classification tree for the same formula as the earlier partitioned data. The constructed tree can be visualized using the plot function, and the variable name is embossed on the tree with the text function. The equation of the fitted classification tree (see Figure Classification Tree for Hypothyroid) is the following:

Prediction and accuracy is carried out in a similar way as mentioned earlier:

> CT_fit <- rpart(HT2_Formula,data=HT2_Train)
> plot(CT_fit,uniform=TRUE)
> text(CT_fit)
> CT_predict <- predict(CT_fit,newdata=HT2_TestX,type="class")
> CT_Accuracy <- sum(CT_predict==HT2_TestY)/nte
> CT_Accuracy
[1] 0.9874213836

Figure 6: Classification tree for Hypothyroid

Consequently, the classification tree gives an accuracy of 98.74%, which is the best of the four models considered thus far. Next, we will consider the final model, support vector machines.

Support vector machines

Support vector machines, abbreviated popularly as SVM, are an important class of machine learning techniques. Theoretically, SVM can take an infinite number of features/covariates and build the appropriate classification or regression SVMs.

SVM for hypothyroid classification

The svm function from the e1071 package will be useful for building an SVM classifier on the Hypothyroid dataset. Following the usual practice, we have the following output in the R session:

> SVM_fit <- svm(HT2_Formula,data=HT2_Train)
> SVM_predict <- predict(SVM_fit,newdata=HT2_TestX,type="class")
> SVM_Accuracy <- sum(SVM_predict==HT2_TestY)/nte
> SVM_Accuracy
[1] 0.9842767296

The SVM technique gives us an accuracy of 98.43%, which is the second best of the models set up thus far.

In the next section, we will run each of the five classification models for the Waveform, German Credit, Iris, and Pima Indians Diabetes problem datasets.

The right model dilemma!

In the previous section, we ran five classification models for the Hypothyroid dataset. Here, the task is to repeat the exercise for four other datasets. It would be a very laborious task to change the code in the appropriate places and repeat the exercise four times over. Thus, to circumvent this problem, we will create a new function referred to as Multiple_Model_Fit. This function will take four arguments: formula, train, testX, and testY. The four arguments have already been set up for each of the five datasets. The function is then set up in a way that generalizes the steps of the previous section for each of the five models.

The function proceeds to create a matrix whose first column consists of the model name, while the second column consists of the accuracy. This matrix is returned as the output of this function:

> Multiple_Model_Fit <- function(formula,train,testX,testY){
+   ntr <- nrow(train) # Training size
+   nte <- nrow(testX) # Test size
+   p <- ncol(testX)
+   testY_numeric <- as.numeric(testY)
+   
+   # Neural Network
+   set.seed(12345)
+   NN_fit <- nnet(formula,data = train,size=p,trace=FALSE)
+   NN_Predict <- predict(NN_fit,newdata=testX,type="class")
+   NN_Accuracy <- sum(NN_Predict==testY)/nte
+   
+   # Logistic Regressiona
+   LR_fit <- glm(formula,data=train,family = binomial())
+   LR_Predict <- predict(LR_fit,newdata=testX,type="response")
+   LR_Predict_Bin <- ifelse(LR_Predict>0.5,2,1)
+   LR_Accuracy <- sum(LR_Predict_Bin==testY_numeric)/nte
+   
+   # Naive Bayes
+   NB_fit <- naiveBayes(formula,data=train)
+   NB_predict <- predict(NB_fit,newdata=testX)
+   NB_Accuracy <- sum(NB_predict==testY)/nte
+   
+   # Decision Tree
+   CT_fit <- rpart(formula,data=train)
+   CT_predict <- predict(CT_fit,newdata=testX,type="class")
+   CT_Accuracy <- sum(CT_predict==testY)/nte
+   
+   # Support Vector Machine
+   svm_fit <- svm(formula,data=train)
+   svm_predict <- predict(svm_fit,newdata=testX,type="class")
+   svm_Accuracy <- sum(svm_predict==testY)/nte
+   
+   Accu_Mat <- matrix(nrow=5,ncol=2)
+   Accu_Mat[,1] <- c("Neural Network","Logistic Regression","Naive Bayes",
+                 "Decision Tree","Support Vector Machine")
+   Accu_Mat[,2] <- round(c(NN_Accuracy,LR_Accuracy,NB_Accuracy,
+                     CT_Accuracy,svm_Accuracy),4)
+   return(Accu_Mat)
+   
+ }

Multiple_Model_Fit is now applied to the Hypothyroid dataset, and the results can be seen to be in agreement with the previous section:

> Multiple_Model_Fit(formula=HT2_Formula,train=HT2_Train,
+                    testX=HT2_TestX,
+                    testY=HT2_TestY)
     [,1]                     [,2]    
[1,] "Neural Network"         "0.989" 
[2,] "Logistic Regression"    "0.9733"
[3,] "Naive Bayes"            "0.9733"
[4,] "Decision Tree"          "0.9874"
[5,] "Support Vector Machine" "0.9843"

The Multiple_Model_Fit function is then applied to the other four classification datasets:

> Multiple_Model_Fit(formula=Waveform_DF_Formula,train=Waveform_DF_Train,
+                    testX=Waveform_DF_TestX,
+                    testY=Waveform_DF_TestY)
     [,1]                     [,2]    
[1,] "Neural Network"         "0.884" 
[2,] "Logistic Regression"    "0.8873"
[3,] "Naive Bayes"            "0.8601"
[4,] "Decision Tree"          "0.8435"
[5,] "Support Vector Machine" "0.9171"
> Multiple_Model_Fit(formula=GC2_Formula,train=GC2_Train,
+                    testX=GC2_TestX,
+                    testY =GC2_TestY )
     [,1]                     [,2]    
[1,] "Neural Network"         "0.7252"
[2,] "Logistic Regression"    "0.7572"
[3,] "Naive Bayes"            "0.8083"
[4,] "Decision Tree"          "0.7061"
[5,] "Support Vector Machine" "0.754" 
> Multiple_Model_Fit(formula=ir2_Formula,train=ir2_Train,
+                    testX=ir2_TestX,
+                    testY=ir2_TestY)
     [,1]                     [,2]
[1,] "Neural Network"         "1" 
[2,] "Logistic Regression"    "1" 
[3,] "Naive Bayes"            "1" 
[4,] "Decision Tree"          "1" 
[5,] "Support Vector Machine" "1"  
> Multiple_Model_Fit(formula=PID_Formula,train=PimaIndiansDiabetes_Train,
+                    testX=PimaIndiansDiabetes_TestX,
+                    testY=PimaIndiansDiabetes_TestY)
     [,1]                     [,2]    
[1,] "Neural Network"         "0.6732"
[2,] "Logistic Regression"    "0.751" 
[3,] "Naive Bayes"            "0.7821"
[4,] "Decision Tree"          "0.7588"
[5,] "Support Vector Machine" "0.7665"

The results for each of the datasets are summarized in the following table:

Table 1: Accuracy of five models for five datasets

The iris dataset is a straightforward and simplistic problem, and therefore each of the five models gives us 100% accuracy on the test data. This dataset will not be pursued any further.

For each dataset, we highlight the highest accuracy cell in grey, and highlight the next highest in yellow.

Here is the modeling dilemma. The naïve Bayes method turns out the best for the German and Pima Indian Diabetes datasets. The decision tree gives the highest accuracy for the Hypothyroid dataset, while SVM gives the best results for Waveform. The runner-up place is secured twice by logistic regression and twice by SVM. However, we also know that, depending on the initial seeds and maybe the number of hidden neurons, the neural networks are also expected to perform the best for some datasets. We then also have to consider whether the results will turn out differently for different partitions.

It is in such practical scenarios we would prefer to have a single approach that ensures reasonable properties. With the Hypothyroid dataset, the accuracy for each of the models is 97% or higher, and one might not go wrong with any of the models. However, in the German and Pima Indian Diabetes problems, the maximum accuracy is 80% and 78%, respectively. It would then be better if we can make good use of all the models and build a single unified one with increased accuracy.

An ensemble purview

The caret R package is core to ensemble machine learning methods. It provides a large framework and we can also put different statistical and machine learning models together to create an ensemble. For the recent version of the package on the author's laptop, the caret package provides access to the following models:

> library(caret)
> names(getModelInfo())
  [1] "ada"                 "AdaBag"              "AdaBoost.M1" 
  [4] "adaboost"            "amdai"               "ANFIS" 
  [7] "avNNet"              "awnb"                "awtan"        
     
[229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio 
[232] "vglmCumulative"      "widekernelpls"       "WM" 
[235] "wsrf"                "xgbLinear"           "xgbTree" 
[238] "xyf"

Depending on your requirements, you can choose any combination of these 238 models. The authors of the package keep on updating this list. It is to be noted that not all models will be available in the caret package, and that it is a platform that facilitates the ensembling of these methods. Consequently, if you choose a model such as ANFIS, and the R package frbs contains this function, which is not available on your machine, then caret will display a message on the terminal as indicated in the following snippet:

Figure 7: Caret providing a message to install the required R package

You need to key in the number 1 and continue. The package will be installed and loaded, and the program will continue. It is good to know the host of options for ensemble methods. A brief method for stack ensembling analytical models is provided here, and the details will unfold later in the book.

For the Hypothyroid dataset, we had a high accuracy of an average of 98% between the five models. The Waveform dataset saw an average accuracy of approximately 88%, while the average for German Credit data is 75%. We will try to increase the accuracy for this dataset. The accuracy improvement will be attempted using three models: naïve Bayes, logistic regression, and classification tree. First, we need to partition the data into three parts: train, test, and stack:

> load("../Data/GC2.RData")
> set.seed(12345)
> Train_Test_Stack <- sample(c("Train","Test","Stack"),nrow(GC2),replace = TRUE,prob = c(0.5,0.25,0.25))
> GC2_Train <- GC2[Train_Test_Stack=="Train",]
> GC2_Test <- GC2[Train_Test_Stack=="Test",]
> GC2_Stack <- GC2[Train_Test_Stack=="Stack",]The dependent and independent variables will be marked next in character vectors for programming convenient. 

> # set label name and Exhogenous
> Endogenous <- 'good_bad'
> Exhogenous <- names(GC2_Train)[names(GC2_Train) != Endogenous]

The model will be built on the training data first and accuracy will be assessed using the metric of Area Under Curve, the curve being the ROC. The control parameters will be set up first and the three models, naïve Bayes, classification tree, and logistic regression, will be created using the training dataset:

> # Creating a caret control object for the number of 
> # cross-validations to be performed
> myControl <- trainControl(method='cv', number=3, returnResamp='none')
> # train all the ensemble models with GC2_Train
> model_NB <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                    method='naive_bayes', trControl=myControl)
> model_rpart <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                      method='rpart', trControl=myControl)
> model_glm <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                        method='glm', trControl=myControl)

Predictions for the test and stack blocks are carried out next. We store the predicted probabilities along the test and stack data frames:

> # get predictions for each ensemble model for two last datasets
> # and add them back to themselves
> GC2_Test$NB_PROB <- predict(object=model_NB, GC2_Test[,Exhogenous],
+                              type="prob")[,1]
> GC2_Test$rf_PROB <- predict(object=model_rpart, GC2_Test[,Exhogenous],
+                             type="prob")[,1]
> GC2_Test$glm_PROB <- predict(object=model_glm, GC2_Test[,Exhogenous],
+                                  type="prob")[,1]
> GC2_Stack$NB_PROB <- predict(object=model_NB, GC2_Stack[,Exhogenous],
+                               type="prob")[,1]
> GC2_Stack$rf_PROB <- predict(object=model_rpart, GC2_Stack[,Exhogenous],
+                              type="prob")[,1]
> GC2_Stack$glm_PROB <- predict(object=model_glm, GC2_Stack[,Exhogenous],
+                                   type="prob")[,1]

The ROC is an important measure for model assessments. The higher the area under the ROC, the better the model would be. Note that these measures, or any other measure, will not be the same as the models fitted earlier since the data has changed:

> # see how each individual model performed on its own
> AUC_NB <- roc(GC2_Test[,Endogenous], GC2_Test$NB_PROB )
> AUC_NB$auc
Area under the curve: 0.7543
> AUC_rf <- roc(GC2_Test[,Endogenous], GC2_Test$rf_PROB )
> AUC_rf$auc
Area under the curve: 0.6777
> AUC_glm <- roc(GC2_Test[,Endogenous], GC2_Test$glm_PROB )
> AUC_glm$auc
Area under the curve: 0.7446

For the test dataset, we can see that the area under curve for the naïve Bayes, classification tree, and logistic regression are respectively 0.7543, 0.6777, and 0.7446. If we put the predicted values together in some format, and that leads to an increase in the accuracy, the purpose of the ensemble technique has been accomplished. As such, we consider the new predicted probabilities under the three models and append them to the stacked data frame. These three columns will now be treated as new input vectors. We then build a naïve Bayes model, an arbitrary choice, and you can try any other model (not necessarily restricted to one of these) for the stacked data frame. The AUC can then be predicted and calculated:

> # Stacking it together
> Exhogenous2 <- names(GC2_Stack)[names(GC2_Stack) != Endogenous]
> Stack_Model <- train(GC2_Stack[,Exhogenous2], GC2_Stack[,Endogenous], 
+                      method='naive_bayes', trControl=myControl)
> Stack_Prediction <- predict(object=Stack_Model,GC2_Test[,Exhogenous2],type="prob")[,1]
> Stack_AUC <- roc(GC2_Test[,Endogenous],Stack_Prediction)
> Stack_AUC$auc
Area under the curve: 0.7631

The AUC for the stacked data observations is higher than any of the earlier models, which is an improvement.

A host of questions should arise for the critical thinker. Why should this technique work? Will it lead to improvisations under all possible cases? If yes, will simply adding new model predictions lead to further improvements? If no, how does one pick the base models so that we can be reasonably assured of improvisations? What are the restrictions on the choice of models? We will provide solutions to most of these questions throughout this book. In the next section, we will quickly look at some useful statistical tests that will aid the assessment of model performance.

Complementary statistical tests

Here, a model is selected over another plausible one. The accuracy of one model seems higher than the other. The area under curve (AUC) of the ROC of a model is greater than that of another. However, it is not appropriate to base the conclusion on pure numbers only. It is important to conclude whether the numbers hold significance from the point of view of statistical inference. In the analytical world, it is pivotal that we make use of statistical tests whenever they are available to validate claims/hypotheses. A reason for using statistical tests is that probability can be highly counterintuitive, and what appears on the surface might not be the case upon closer inspection, after incorporating the chance variation. For instance, if a fair coin is tossed 100 times, it is imprudent to think that the number of heads must be exactly 50. Hence, if a fair coin shows up 45 heads, we need to incorporate the chance variation that the number of heads can be less than 50 too. Caution must be exerted all the while when we are dealing with uncertain data. A few examples are in order here. Two variables might appear to be independent of each other, and the correlation might also be nearly equal to zero. However, applying the correlation test might result in the conclusion that the correlation is not significantly zero. Since we will be sampling and resampling a lot in this text, we will look at related tests.

Permutation test

Suppose that we have two processes, A and B, and the variances of these two processes are known to be equal, though unknown. Three independent observations from process A result in yields of 18, 20, and 22, while three independent observations from process B gives yields of 24, 26, and 28. Under the assumption that the yield follows a normal distribution, we would like to test whether the means of processes A and B are the same. This is a suitable case for applying the t-test, since the number of observations is smaller. An application of the t.test function shows that the two means are different to each other, and this intuitively appears to be the case.

Now, the assumption under the null hypothesis is that the means are equal, and that the variance is unknown and assumed to be equal under the two processes. Consequently, we have a genuine reason to believe that the observations from process A might well have occurred in process B too, and vice versa. We can therefore swap one observation in process B with process A, and recompute the t-test. The process can be repeated for all possible permutations of the two samples. In general, if we have m samples from population 1 and n samples from population 2, we can have

different samples and as many tests. An overall test can be based on such permutation samples and such tests are called permutation tests.

For process A and B observations, we will first apply the t-test and then the permutation test. The t.test is available in the core stats package and the permutation t-test is taken from the perm package:

> library(perm)
> x <- c(18,20,22); y <- c(24,26,28)
> t.test(x,y,var.equal = TRUE)
Two Sample t-test
data:  x and y
t = -3.6742346, df = 4, p-value = 0.02131164
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.533915871  -1.466084129
sample estimates:
mean of x mean of y 
       20        26

The smaller p-value suggests that the means of processes A and B are not equal. Consequently, we now apply the permutation test permTS from the perm package:

> permTS(x,y)
Exact Permutation Test (network algorithm)
data:  x and y
p-value = 0.1
alternative hypothesis: true mean x - mean y is not equal to 0
sample estimates:
mean x - mean y 
             -6

The p-value is now at 0.1, which means that the permutation test concludes that the means of the processes are equal. Does this mean that the permutation test will always lead to this conclusion, contradicting the t-test? The answer is given in the next code segment:

> x2 <- c(16,18,20,22); y2 <- c(24,26,28,30)
> t.test(x2,y2,var.equal = TRUE)
Two Sample t-test
data:  x2 and y2
t = -4.3817805, df = 6, p-value = 0.004659215
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -12.46742939  -3.53257061
sample estimates:
mean of x mean of y 
       19        27 
> permTS(x2,y2)
Exact Permutation Test (network algorithm)
data:  x2 and y2
p-value = 0.02857143
alternative hypothesis: true mean x2 - mean y2 is not equal to 0
sample estimates:
mean x2 - mean y2 
               -8

Chi-square and McNemar test

We had five models for the hypothyroid test. We then calculated the accuracy and were satisfied with the numbers. Let's first look at the number of errors that the fitted model makes. We have 636 observations in the test partition and 42 of them test positive for the hypothyroid problem. Note that if we mark all the patients as negative, we would be getting an accuracy of 1-42/636 = 0.934, or about 93.4%. Using the table function, we pit the actuals against the predicted values and see how often the fitted model goes wrong. We remark here that identifying the hypothyroid cases as the same and the negative cases as negative is the correct prediction, while marking the hypothyroid case as negative and vice versa leads to errors. For each model, we look at the misclassification errors:

> table(LR_Predict_Bin,testY_numeric)
              testY_numeric
LR_Predict_Bin   1   2
             1  32   7
             2  10 587
> table(NN_Predict,HT2_TestY)
             HT2_TestY
NN_Predict    hypothyroid negative
  hypothyroid          41       22
  negative              1      572
> table(NB_predict,HT2_TestY)
             HT2_TestY
NB_predict    hypothyroid negative
  hypothyroid          33        8
  negative              9      586
> table(CT_predict,HT2_TestY)
             HT2_TestY
CT_predict    hypothyroid negative
  hypothyroid          38        4
  negative              4      590
> table(SVM_predict,HT2_TestY)
             HT2_TestY
SVM_predict   hypothyroid negative
  hypothyroid          34        2
  negative              8      592

From the misclassification table, we can see that the neural network identifies 41 out of the 42 cases of hypothyroid correctly, but it identifies way more cases of hypothyroid incorrectly too. The question that arises is whether the correct predictions of the fitted models only occur by chance, or whether they depend on truth and can be explained. To test this, in the hypotheses framework we would like to test whether the actuals and predicted values of the actuals are independent of or dependent on each other. Technically, the null hypothesis is that the prediction is independent of the actual, and if a model explains the truth, the null hypothesis must be rejected. We should conclude that the fitted model predictions depend on the truth. We deploy two solutions here, the chi-square test and the McNemar test:

> chisq.test(table(LR_Predict_Bin,testY_numeric))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
X-squared = 370.53501, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NN_Predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NN_Predict, HT2_TestY)
X-squared = 377.22569, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(NB_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(NB_predict, HT2_TestY)
X-squared = 375.18659, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(CT_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(CT_predict, HT2_TestY)
X-squared = 498.44791, df = 1, p-value < 0.00000000000000022204
> chisq.test(table(SVM_predict,HT2_TestY))
Pearson's Chi-squared test with Yates' continuity correction
data:  table(SVM_predict, HT2_TestY)
X-squared = 462.41803, df = 1, p-value < 0.00000000000000022204
> mcnemar.test(table(LR_Predict_Bin,testY_numeric))
McNemar's Chi-squared test with continuity correction
data:  table(LR_Predict_Bin, testY_numeric)
McNemar's chi-squared = 0.23529412, df = 1, p-value = 0.6276258
> mcnemar.test(table(NN_Predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NN_Predict, HT2_TestY)
McNemar's chi-squared = 17.391304, df = 1, p-value = 0.00003042146
> mcnemar.test(table(NB_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(NB_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(CT_predict,HT2_TestY))
McNemar's Chi-squared test
data:  table(CT_predict, HT2_TestY)
McNemar's chi-squared = 0, df = 1, p-value = 1
> mcnemar.test(table(SVM_predict,HT2_TestY))
McNemar's Chi-squared test with continuity correction
data:  table(SVM_predict, HT2_TestY)
McNemar's chi-squared = 2.5, df = 1, p-value = 0.1138463

The answer provided by the chi-square tests clearly shows that the predictions of each fitted model is not down to chance. It also shows that the prediction of hypothyroid cases, as well as the negative cases, is expected of the fitted models. The interpretation of and conclusions from the McNemar's test is left to the reader. The final important measure in classification problems is the ROC curve, which is considered next.

ROC test

The ROC curve is an important improvement on the false positive and true negative measures of model performance. For a detailed explanation, refer to Chapter 9 of Tattar et al. (2017). The ROC curve basically plots the true positive rate against the false positive rate, and we measure the AUC for the fitted model.

The main goal that the ROC test attempts to achieve is the following. Suppose that Model 1 gives an AUC of 0.89 and Model 2 gives 0.91. Using the simple AUC criteria, we outright conclude that Model 2 is better than Model 1. However, an important question that arises is whether 0.91 is significantly higher than 0.89. The roc.test, from the pROC R package, provides the answer here. For the neural network and classification tree, the following R segment gives the required answer:

> library(pROC)
> HT_NN_Prob <- predict(NN_fit,newdata=HT2_TestX,type="raw")
> HT_NN_roc <- roc(HT2_TestY,c(HT_NN_Prob))
> HT_NN_roc$auc
Area under the curve: 0.9723826
> HT_CT_Prob <- predict(CT_fit,newdata=HT2_TestX,type="prob")[,2]
> HT_CT_roc <- roc(HT2_TestY,HT_CT_Prob)
> HT_CT_roc$auc
Area under the curve: 0.9598765
> roc.test(HT_NN_roc,HT_CT_roc)
	DeLong's test for two correlated ROC curves
data:  HT_NN_roc and HT_CT_roc
Z = 0.72452214, p-value = 0.4687452
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
 AUC of roc1  AUC of roc2 
0.9723825557 0.9598765432

Since the p-value is very large, we conclude that the AUC for the two models is not significantly different.

Statistical tests are vital and we recommend that they be used whenever suitable. The concepts highlighted in this chapter will be drawn on in more detail in the rest of the book.

Summary

The chapter began with an introduction to some of the most important datasets that will be used in the rest of the book. The datasets covered a range of analytical problems including classification, regression, time series, survival, clustering, and a dataset in which identifying an outlier is important. Important families of classification models were then introduced in the statistical/machine learning models section. Following the introduction of a variety of models, we immediately saw the shortcoming, in that we don't have a model for all seasons. Model performance varies from dataset to dataset. Depending on the initialization, the performance of certain models (such as neural networks) is affected. Consequently, there is a need to find a way to ensure that the models can be improved upon in most scenarios.

This paves the way for the ensemble method, which forms the title of this book. We will elaborate on this method in the rest of the book. This chapter closed with quick statistical tests that will help in carrying out model comparisons. Resampling forms the core of ensemble methods, and we will look at the important jackknife and bootstrap methods in the next chapter.

About the Author

Prabhanjan Narayanachar Tattar

Prabhanjan Narayanachar Tattar is a lead statistician and manager at the Global Data Insights & Analytics division of Ford Motor Company, Chennai. He received the IBS(IR)-GK Shukla Young Biometrician Award (2005) and Dr. U.S. Nair Award for Young Statistician (2007). He held SRF of CSIR-UGC during his PhD. He has authored books such as Statistical Application Development with R and Python, 2nd Edition, Packt; Practical Data Science Cookbook, 2nd Edition, Packt; and A Course in Statistics with R, Wiley. He has created many R packages.
Browse publications by this author

Good book for starters. Hands on code a big plus!!!