Solutions
You will find the solutions to the exercises below.
Chapter 1 – Setting GNU R for Predictive Modeling
Chapter 1 already contains the exercises and solutions.
Chapter 2 – Visualizing and Manipulating Data Using R
One way to solve this is to first create a vector with a size of 1001
(line 1). The first value is the money that the player comes with (assigned on line 2). On line 3, we assign a copy of the isRed
attribute to the wins
vector. On lines 5 to 7, we compute the amount of money at each trial by adding 1 to the money on the next turn if the drawn number is red, and remove 1 if the number is not red. On line 8, we set the graphic parameters to contain a single plot (using argument mfrow
) and to be of the required dimensions (10 by 4 inches) using the pin
argument. Finally, we generate the plot on line 9:
1 money = rep(0,1001) 2 money[1] = 1000 3 wins = Data$isRed 4 for (i in 1:nrow(Data)) { 5 if (wins[i] == 1) money[i+1] = money[i] + 1 6 else money[i+1] = money[i] - 1 7 } 8 par (mfrow = c(1,1), pin=c(5,2)) 9 plot(money, type = "l", xlab = "Trial number")
Run the code and have a look at your screen to know more about the outcomes of the player's game!
Chapter 3 – Data Visualization with Lattice
We plot these elements as follows:
library(lattice) xyplot(Petal.Length ~ Petal.Width | Species, data = iris, panel = function(x, y, ...) { panel.xyplot(x,y) panel.abline(lm(y~x)) } )
Chapter 4 – Cluster Analysis
The following code performs all the operations:
library(NbClust) NbClust(iris[1:4],distance="euclidean", method= "kmeans") NbClust(iris[1:4],distance="manhattan", method= "kmeans") NbClust(iris[1:4],distance="maximum", method= "kmeans")
The outputs all show that a two-cluster solution is the best according to a majority rule. This is surprising because we know there are three classes in the dataset, and kmeans()
is very accurate at classifying the cases. But apparently, this is not the best way to classify the data. Without prior knowledge, we could have concluded that there are observations for only two species of flowers in the dataset.
Chapter 5 – Agglomerative Clustering Using hclust()
We first create a vector with all the possible distances allowed by the dist()
function; we then create a list, called d
, of one element—the first distance matrix (with the Euclidean distance). We then add the other distance matrices to the list. Please note that the binary distance was included here for demonstration purposes, but the binary distance is of no practical use with quantitative data:
1 dist.types = c("euclidean", "maximum", "manhattan", "canberra", 2 "binary", "minkowski") 3 d = list(dist(iris[1:4], method = dist.types[1])) 4 for (i in 2:length(dist.types)) d = c(d,list(dist(iris[1:4], 5 method = dist.types[i])))
Next, we create a vector with all the possible agglomerating methods allowed by hclust()
. We then perform the analysis six times for each of these methods (once per distance matrix), saving the result of each iteration in a matrix of lists:
1 agglo.types = c("ward.D", "ward.D2", "single", "complete", 2 "average", "mcquitty", "median", "centroid") 3 M = array(list(), 8) 4 for (i in 1:length(agglo.types)) { 5 OneDList = list(hclust(d[[1]], method = agglo.types[i])) 6 for (j in 2:length(dist.types)) OneDList = 7 c(OneDList, list(hclust(d[[j]], method = agglo.types[i]))) 8 M[[i]] = OneDList 9 }
The 48 hclust
models stored in the M
matrix could be used for selecting the best models, for instance, by examining the plotted models. We can plot results with complete linkage and Euclidean distance (graph not included) like this:
plot(M[[4]][[1]])
The reader can notice that the results using binary distance (for example, with complete linkage as follows) are meaningless:
plot(M[[4]][[5]])
Chapter 6 – Dimensionality Reduction with Principal Component Analysis
Here are the solutions. Before anything, let's make sure the psych
package is loaded:
library(psych)
The display of the number of missing values can be done for all items in one line of code:
summary(bfi[,1:25])
We can see that there are between 9 and 36 missing values for each of the items.
We assign the cases without missing values to a new object called
Dat
, retaining only thebfi
measures:Dat = na.omit(bfi[,1:25])
We can know how many cases of the original dataset contain missing values using the following line of code. The answer is
364
:dim(bfi)[1] - dim(Dat)[1]
We examine the results of Bartlett's test of sphericity as follows:
cortest.normal(Dat)
The results show that the data set is significantly different from an identity matrix. We can examine the KMO index using the following code:
KMO(Dat)
The results show that the overall MSA value is large (0.85), as are most individual item values (lowest at 0.76). We deduce from these tests that performing PCA on this data is warranted.
We run the PCA using the following line of code:
pcas = princomp(Dat)
We plot the eigenvalues (the squared standard deviations) like this:
plot(pcas$sd^2)
We find that a five-factor solution seems optimal, which is great as we have five theoretical factors of personality (the questionnaire is designed that way)
We rerun the analysis with five factors and include the scores as follows:
pcas2 = principal(Dat,nfactors=5, rotate="varimax", scores=T)
We add the factorial scores to the dataset as follows:
Dat = cbind(Dat, pcas2$scores)
The output of the following line of code shows that the cumulative proportion of variance is 54 percent (under
Cumulative var
, last column).pcas2
The component loadings also appear on the output (under Standardized loadings). We can see that the component called RC1 is mostly related to items with letter E (extraversion). RC2 loads higher on items with letter N (neuroticism), RC3 with items with letter C (conscience), RC4 with items with letter O (openness). Finally, RC5 has a higher loading with items with letter A (agreeability).
Chapter 7 – Exploring Association Rules with Apriori
You will find the solution to this exercise below.
Here we will start from the beginning, by first obtaining a dataset without an attribute in column 4 (race):
library(vcdExtra) ICU_norace = ICU[-4]
We then load the arules package and generate the transaction list:
library(arules)
We then prepare the dataset:
ICU_norace$age = cut(ICU_norace$age, breaks = 4) ICU_norace$systolic = cut(ICU_norace$systolic, breaks = 4) ICU_norace$hrtrate = cut(ICU_norace$hrtrate, breaks = 4) ICU_norace_tr = as(ICU_norace, "transactions")
Then, we generate the rules:
rulesLowpco = apriori(ICU_norace_tr, parameter = list(confidence = 0.8, support=.1, minlen = 2), appearance = list(lhs = c("pco=<=45"), default="rhs"))
We create a data frame from the rules as follows:
rulesLowpco.df = as(rulesLowpco,"data.frame")
We create the vector of the p value for Fisher's exact test as follows:
IM = interestMeasure(rulesLowpco, "fishersExactTest", ICU_norace_tr)
We then append it to the data frame:
rulesLowpco.df$IM= round(IM, digits = 2)
Finally, we plot the relationship between lift and the p value. We can see that the higher the lift, the lower the p value:
plot (rulesLowpco.df$lift, rulesLowpco.df$IM, xlab = "lift", ylab = "Significance")
Chapter 8 – Probability Distributions, Covariance, and Correlation
The following code will compute the probabilities of each outcome of interest:
1 p = 18/37 2 N = 100 3 n = 1 4 v40_49 = rep(1,10) 5 v51_60 = v40_49 6 i=1 7 for (n in 40:49) { 8 v40_49[i] = choose(N, n) * (p^n) * (1 - p)^(N-n) 9 i = i + 1 10 } 11 i=1 12 for (n in 51:60) { 13 v51_60[i] = choose(N, n) * (p^n) * (1 - p)^(N-n) 14 i = i + 1 15 }
We can display the probability of obtaining between 40 and 49 red numbers drawn by summing the individual probabilities. The result is approximately 53.5 percent:
sum(v40_49)
We can do the same for the probability of obtaining between 51 and 60 red numbers. The result is approximately 34.7 percent:
sum(v51_60)
So it would be better to bet on black, right ? Well, this wouldn't make a difference. Because of the presence of the 0, which isn't black nor red, the probabilities of winning a several bets is strongly lower than the probabilities of losing, as the risk of losing each trial is higher than the chances of winning (19/37 versus 18/37).
The correlation between petal width and petal length can be computed as follows:
cor.test(iris$Petal.Width,iris$Petal.Length)
The correlation between these attributes is around 0.963, and is significant at
p < .001
.
Chapter 9 – Linear Regression
We examine the effect of work-family conflict (attribute
WFC
) on work satisfaction (WorkSat
) in the first model calledmodel01
:model01 = lm(WorkSat ~ WFC, data = nurses)
We create a second model called
model02
, in which you includeWFC
and exhaustion (Exhaus
) as predictors ofWorkSat
:model02 = lm(WorkSat ~ WFC + Exhaus, data = nurses)
We can check the relationship between
WFC
andWorksat
in both models using the following code. We notice that the effect ofWFC
on work satisfaction is significant inmodel01
, but not anymore inmodel02
, that is when exhaustion is included in the model.summary(model01) summary(model02)
Here is the code for computing
model03
withWFC
included as a predictor of exhaustion:model03 = lm(Exhaus ~ WFC, data = nurses)
The summary of
model03
shows thatWFC
is a significant predictor ofExhaus
:summary(model03)
We therefore perform a Sobel test for the mediation of the relationship between
WFC
andWorkSat
byExhaus
as follows:library(bda) mediation.test(nurses$Exhaus, nurses$WFC, nurses$WorkSat)
We can see that the value for the Sobel test is significant, which attests to the presence of the mediation
Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes
The following code will generate the kappas
values for numbers of neighbors ranging from 2 to 10 (the first value is for two neighbors, the second for three):
1 library(psych); library(class) 2 kappas = rep(0,9) 3 for (i in 1:9) { 4 set.seed(2222) 5 pred.train = knn(train = TRAIN[,2:13], test = TRAIN[,2:13], 6 cl = TRAIN$season, k = i+1) 7 tab = table(pred.train,TRAIN[,14]) 8 kappas[i] = cohen.kappa(tab)[1] 9 }
Using the following code, we notice that the two clusters' solution has the highest kappa:
which.max(kappas)
We therefore use this solution to predict our data in the testing set and compute the kappa value, which is only 0.19 (that's pretty bad!):
pred.test = knn(train = TRAIN[,2:13], test = TEST[,2:13], cl = TRAIN$season, k = 2) cohen.kappa(table(pred.test,TEST[,14]))[1]
Chapter 11 – Classification Trees
The C4.5 model is generated as follows:
library(RWeka) IRIS.C45 = J48(Species~ . , data= IRIStrain, control= Weka_control(U=FALSE))
We compute the predictions as follows:
C45.preds = predict(IRIS.C45, IRIStest)
The CART model is generated as follows:
library(rpart) IRIS.CART = rpart(Species ~. , data= IRIStrain)
The following line of code computes a data frame with the predicted probabilities:
CART.probs = as.data.frame(predict(IRIS.CART, IRIStest))
We now obtain the column number in which the predicted probabilities are the highest for each case:
CART.preds = apply(CART.probs, 1, which.max)
Finally, we assign the name of that column to each observation:
for (i in 1:length(CART.preds)) { col= as.numeric(CART.preds[i]) CART.preds[i] = names(CART.probs[col]) }
Let's obtain the confusion matrix for both classifications:
CONF.C45 = table(IRIStest$Species, C45.preds) CONF.CART = table(IRIStest$Species, CART.preds)
We create a function that computes the accuracy of a confusion matrix for us:
acc = function(table) { sum(diag(dim(table)[1])*table) / sum(CONF.C45) }
We can see that, for this dataset, CART has a little advantage in accuracy (0.95 versus 0.92):
acc(CONF.C45) acc(CONF.CART)
Chapter 12 – Multilevel Analyses
We obtain the graph for the relationship between Exhaust
and WorkSat
as follows:
library(lattice) attach(NursesML) xyplot(WorkSat~Exhaust | as.factor(hosp), panel = function(x,y) { panel.xyplot(x,y) panel.lmline(x,y) })
We generate the second graph as follows:
xyplot(WorkSat~Depers | as.factor(hosp), panel = function(x,y) { panel.xyplot(x,y) panel.lmline(x,y) })
We obtain the answer to questions 2 and 3 using the following line of code. The first value is the intercept, the second value is the coefficient for the predicted value (the increase on the actual work satisfaction for an increase of 1 in the predicted value):
coeffs(modelPred)
Chapter 13 – Text Analytics with R
This is easily done as follows. For the original corpus type the following code:
DTM = as.matrix(DocumentTermMatrix(acq)) FR = colSums(DTM) FR[FR>100]
For the preprocessed corpus type the following code:
acq2 = preprocess(acq) DTM2 = as.matrix(DocumentTermMatrix(acq2)) FR2 = colSums(DTM2) FR2[FR2>100]
We obtain the graph as follows:
barplot(sort(FR2[FR2>30]), names.arg = rownames(sort(FR2[FR2>30])))