*(For more resources related to this topic, see here.)*

# Supervised learning for classification

Like clustering, classification is also about categorizing data instances, but in this case, the categories are known and are termed as class labels. Thus, it aims at identifying the category that a new data point belongs to. It uses a dataset where the class labels are known to find the pattern. Classification is an instance of supervised learning where the learning algorithm takes a known set of input data and corresponding responses called class labels and builds a predictor model that generates reasonable predictions for the class labels in the unknown data. To illustrate, let's imagine that we have gene expression data from cancer patients as well as healthy patients. The gene expression pattern in these samples can define whether the patient has cancer or not. In this case, if we have a set of samples for which we know the type of tumor, the data can be used to learn a model that can identify the type of tumor. In simple terms, it is a predictive function used to determine the tumor type. Later, this model can be applied to predict the type of tumor in unknown cases.

There are some do's and don'ts to keep in mind while learning a classifier. You need to make sure that you have enough data to learn the model. Learning with smaller datasets will not allow the model to learn the pattern in an unbiased manner and again, you will end up with an inaccurate classification. Furthermore, the preprocessing steps (such as normalization) for the training and test data should be the same. Another important thing that one should take care of is to keep the training and test data distinct. Learning on the entire data and then using a part of this data for testing will lead to a phenomenon called over fitting. It is always recommended that you take a look at it manually and understand the question that you need to answer via your classifier.

There are several methods of classification. In this recipe, we will talk about some of these methods. We will discuss **linear discriminant analysis (LDA), decision tree (DT)**, and **support vector machine (SVM)**.

## Getting ready

To perform the classification task, we need two preparations. First, a dataset with known class labels (training set), and second, the test data that the classifier has to be tested on (test set). Besides this, we will use some R packages, which will be discussed when required. As a dataset, we will use approximately 2300 gene from tumor cells. The data has ~83 data points with four different types of tumors. These will be used as our class labels. We will use 60 of the data points for the training and the remaining 23 for the test. To find out more about the dataset, refer to the Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks article by Khan and others (http://research.nhgri.nih.gov/microarray/Supplement/). The set has been precompiled in a format that is readily usable in R and is available on the book's web page (code files) under the name *cancer.rda*.

## How to do it…

To classify data points based on their features, perform the following steps:

- First, load the following MASS library as it has some of the classification functions:

**> library(MASS)** - Now, you need your data to learn and test the classifiers. Load the data from the code files available on the book's web page (
*cancer.rda*) as follows:

**> load ("path/to/code/directory/cancer.rda") # located in the code file directory for the chapter, assign the path accordingly** - Randomly sample 60 data points for the training and the remaining 23 for the test set as follows—ensure that these two datasets do not overlap and are not biased towards any specific tumor type (random sampling):

**> train <- mldata[train_row,] # use sampled indexes to extract training data > test <- mldata[-train_row,] # test set is select by selecting all the other**

data points - For the training data, retain the class labels, which are the tumor columns here, and remove this information from the test data. However, store this information for comparison purposes:

**> testClass <- test$tumor > test$tumor <- NULL** - Now, try the linear discriminate analysis classifier, as follows, to get the classifier model:

**> myLD <- lda(tumor ~ ., train) # might issue a warning** - Test this classifier to predict the labels on your test set, as follows:

**> testRes_lda <- predict(myLD, test)** - To check the number of correct and incorrect predictions, simply compare the predicted classes with the
*testClass*object, which was created in step 4, as follows:

**> sum(testRes_lda$class == testClass) # correct prediction [1] 19 > sum(testRes_lda$class != testClass) # incorrect prediction [1] 4** - Now, try another simple classifier called DT. For this, you need the rpart package:

**> library(rpart)** - Create the decision tree based on your training data, as follows:

**> myDT <- rpart(tumor~ ., data = train, control = rpart.control(minsplit = 10))** - Plot your tree by typing the following commands, as shown in the next diagram:

**> plot(myDT) > text(myDT, use.n=T)**The following screenshot shows the cut off for each feature (represented by the branches) to differentiate between the classes:

The tree for DT-based learning

- Now, test the decision tree classifier on your test data using the following prediction function:

**> testRes_dt <- predict(myDT, newdata= test)** - Take a look at the species that each data instance is put in by the predicted classifier, as follows (1 if predicted in the class, else 0):

**> classes <- round(testRes_dt) > head(classes) BL EW NB RM 4 0 0 0 1 10 0 0 0 1 15 1 0 0 0 16 0 0 1 0 18 0 1 0 0 21 0 1 0 0** - Finally, you'll work with SVMs. To be able to use them, you need another R package named e1071 as follows:

**> library(e1071)** - Create the svm classifier from the training data as follows:

**> mySVM <- svm(tumor ~ ., data = train)** - Then, use your classifier, the model (mySVM object) learned to predict for the test data. You will see the predicted labels for each instance as follows:

**> testRes_svm <- predict(mySVM, test) > testRes_svm**

## How it works…

We started our recipe by loading the input data on tumors. The supervised learning methods we saw in the recipe used two datasets: the training set and test set. The training set carries the class label information. The first part in most of the learning methods shown here, the training set is used to identify a pattern and model the pattern to find a distinction between the classes. This model is then applied on the test set that does not have the class label data to predict the class labels. To identify the training and test sets, we first randomly sample 60 indexes out of the entire data and use the remaining 23 for testing purposes.

The supervised learning methods explained in this recipe follow a different principle. LDA attempts to model the difference between classes based on the linear combination of its features. This combination function forms the model based on the training set and is used to predict the classes in the test set. The LDA model trained on 60 samples is then used to predict for the remaining 23 cases.

DT is, however, a different method. It forms regression trees that form a set of rules to distinguish one class from the other. The tree learned on a training set is applied to predict classes in test sets or other similar datasets.

SVM is a relatively complex technique of classification. It aims to create a hyperplane(s) in the feature space, making the data points separable along these planes. This is done on a training set and is then used to assign classes to new data points. In general, LDA uses linear combination and SVM uses multiple dimensions as the hyperplane for data distinction. In this recipe, we used the svm functionality from the *e1071* package, which has many other utilities for learning.

We can compare the results obtained by the models we used in this recipe (they can be computed using the provided code on the book's web page).

## There's more...

One of the most popular classifier tools in the machine learning community is WEKA. It is a Java-based tool and implements many libraries to perform classification tasks using DT, LDA, Random Forest, and so on. R supports an interface to the WEKA with a library named RWeka. It is available on the CRAN repository at http://cran.r-project.org/web/packages/RWeka/ .

It uses RWekajars, a separate package, to use the Java libraries in it that implement different classifiers.

## See also

- The Elements of Statistical Learning book by Hastie, Tibshirani, and Friedman at http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf, which provides more information on LDA, DT, and SVM

# Bootstrapping in machine learning

Learning from subsamples is one way to test the robustness of a machine learning algorithm and improve the accuracy of a learning algorithm. It is a simple approach for accuracy estimation and provides the bias or variance of the estimator. One such approach is bootstrapping. This is the practice of learning and estimating properties from subsamples of the learning data, as well as iteratively improving the performance. This section will explain bootstrapping with respect to R programming.

# Getting ready

To bootstrap with any function, we need the function (for example, the type of classifier we want to use) and the dataset that we want to perform the bootstrapping on. In this recipe, we will use the iris dataset for this purpose.

## How to do it…

Perform the following steps for bootstrapping using the iris data:

- First, we need our dataset, and as mentioned earlier, we use the iris data for our purpose, as follows:
**> data(iris) > myData = iris[c(1:150),]** - For bootstrapping, create a function that will be bootstrapped across the dataset. The function is an SVM classifier, and it returns the number of
**true positive (TP)**, which serves as your performance value, as follows (this has been selected for simplicity; a more complete measurement of performance discussed in the upcoming recipes can be used as well):**> corPred <- function(data, label, indices){ train = myData[indices,] # indexes for training data test = myData[-indices,] # indexes for test data testClass = test [,label] # assigns class labels (species) colnames(train)[ncol(train)]="Class" mySVM = svm(Class ~ ., data = train, cost = 100, gamma = 1) # learn model using SVM myPred = predict(mySVM, test) # prediction on test set TP = sum(myPred == testClass) # calculate True positives return(TP) }** - Now, write another function to bootstrap with the data and write the number of bootstraps to the function as input, as follows:
**> myboot <- function(d, label, iter){ bootres = c() for(i in 1:iter){ indices = sample(1:nrow(d), floor((2*nrow(d))/3)) # samples indexes res = corPred (d, label, indices) # runs corPred function bootres = c(bootres, res) # append results } return(list(BOOT.RES = bootres, BOOT.MEAN = mean(bootres), BOOT.SD = sd(bootres))) }** - Then, run your bootstrap function as follows:

The results consist of the value for each iteration, that is, the mean of all of the bootstraps and the corresponding standard deviation.**> res.bs <- myboot(d = myData, label="Species",iter = 10000)**

## How it works…

The major part of the recipe has two functions in steps 2 and 3. The function in step 2 extracts the data for the training and test sets from the original data and uses the data for learning and prediction, respectively. A command has been added to compute the total number of true positives as a performance measure. It is very simple to modify it to return other measurements such as sensitivity and specificity (this will be discussed in the forthcoming recipes). The second function in step 3 actually does the bootstraps by sampling data point indices for the training and test sets used in step 2. It also creates a vector of the performance measure by appending results from the iterations (the *BOOT.RES* element in the returned list). Thus, the overall value returned is a list that consists of the actual results (vector of measurements) and their mean and standard deviation. A higher mean and lower standard deviation indicates a consistent classifier with more accurate classifiers (in this case, it is TP). The fourth step simply calls the function in step 3 with appropriate arguments.

# Measuring the performance of classifiers

Measuring the accuracy of the learning algorithms is as important as the algorithms themselves. These measurements of performance define how accurately the data points have been classified or clustered. In this recipe, we will describe the computation of some of these measurements in R.

## Getting ready

To start with, we need some of the results from recipes. We learned about the performance measure for clustering in the Visualizing clusters recipe when we talked about silhouette plots. In this recipe, we compute some of the measurements of performance to assess how good is our classifier. We use a binary classification here (with two class labels) for simplicity. The required R packages will be introduced in the next section.

## How to do it…

To measure the performance of classifiers, perform the following steps:

- First, install and load the
*caret*library as follows:**> install.packages("caret") > library(caret)** - Prepare your data and classifiers using the following function:
**> data (iris) > myData <- iris > indices <- sample (1:nrow(myData), floor ((2*nrow(myData))/3)) > train <- myData [indices,] > test <- myData [-indices,] testClass <- test [,"Species"] > mySVM <- svm (Species ~ ., data = train, cost = 100, gamma =1)** - To compute the performance of a classifier, first get the results from the predictions of the SVM classifier and the actual class labels for the data points, as follows:
**> myPred <- predict(mySVM, test) > myLabels <- testClass** - Then, compute the sensitivity and specificity of your learning method using the following function from the caret package:
**> myCM <- confusionMatrix(myPred, myLabels)** - Extract the various confusion matrices from the object created earlier using the performance function and by specifying the performance measure as follows:
**> CMtable <- myCM$table** - Compute the sensitivity, specificity, and so on, of the classifier for every class as follows:
**> myPerf2 <- myCM$byClass**

## How it works...

The working of this recipe is very simple. The sensitivity and specificity functions from the caret package compare the predicted and original class labels. The *prediction* function from the ROCR package accepts two major arguments in terms of predicted classes and original classes but in numerical terms, which has been shown in the following step. This creates a prediction object that has all the performance measures. The measurement of interest is extracted with the performance function. This function accepts the prediction object and the name of the measurement as the input argument, as depicted in steps 3, 4, and 5.

We can also use the *ROCR* package if we have only binary classes. For this, we convert the factor class label into a numeric type and then use the prediction function from the ROCR package as follows:

> preds <- prediction(as.numeric(myPred), as.numeric(myLabels))

# Visualizing an ROC curve in R

ROC curves provide the complete measure of performance. They are of Specificity versus 1-Sensitivity plots represented on the x and y axes, respectively. The visualization of these curves can provide the entire idea of the accuracy (sensitivity and specificity) of the classifier in one plot. By looking at these plots, you will get a clear picture of the area under the curve; the larger the area, the better the performance. The higher the area under this curve, the higher the sensitivity and specificity. In fact, the top-right region in the ROC curve is also termed as classifier heaven. Here, we intend on creating such visualizations.

## Getting ready

We need our classifier results as input for the computation. We simply take the results of the SVM classifier to avoid repetition of such results.

## How to do it…

To plot visualizations of the ROC curve for the methods explained earlier, perform the following steps:

- First, load the
*ROCR*library as follows:**> library(pROC)** - Then, compute the roc object as follows (here, we do it for the LDA classifier of the tumor data):
**> roc_lda <- plot.roc(as.numeric(testRes_lda$class), as.numeric(testClass))** - Plot these values to get the ROC curve as follows:
**> plot(roc_lda, col="grey")** - For comparison purposes, compute the roc object for another classifier (here, we use svm) as follows:
**> roc_svm <- plot.roc(as.numeric(testRes_svm), as.numeric(testClass))** - Plot the roc object for svm to the existing plot as follows:
**> plot(roc_lda, col="grey") > lines(roc_svm, col="black") > legend("bottomright", c("svm", "lda"), fill = c("black","grey"))**The pROC package also has a function that can specifically deal with multiple classes; use it, in this case, as follows:

**> multiclass.roc(as.numeric(testRes_svm), as.numeric(testClass), percent=TRUE)**

## How it works…

In the following diagram, the plot shows a perfect classification by the SVM classifier. This is an ideal case, with the area under the curve equal to 1. The ROC curve for the method explained in the recipe uses the pROC package and the learning approach used is svm. The graph shows an area under the curve that is equal to 1. This is the ideal **area under curve (AUC)** for machine learning and shows an accurate prediction for each data point (which rarely happens in real life). The diagonal line is just a representation of 50 percent of AUC and does not belong to any classifier. The following diagram shows the ROC plot described previously:

As stated in the previous recipe, we extract the two performance measures, namely the **true positive rate (tpr)** and the **false positive rate (fpr)** from the prediction object. These measures are used to plot the ROC curve. The plot.roc function of the pROC package does this in a single step.

## There's more...

Another interesting library to create ROC plots is the *pROC* package. This recipe works well for binary classification as well. However, it could get complicated if we have multiple classes. For this purpose, we can use the *pROC* package. The *pROC* package provides the option of a multiclass ROC curve. It presents the mean AUC value for all of the classes. The function to be used is *multiclass.roc*. For further details, type the following commands:

> library(pROC) ?multiclass.roc

# Biomarker identification using array data

Until now, we have seen some machine learning methods in R. It will be interesting to look at their application in bioinformatics. In life sciences, comparing groups of individuals to find significant differences is becoming more and more important. It might be a measurement or a substance that indicates the biological condition and can serve as biomarkers.

Let's imagine that we have gene expression data from patients and healthy controls. Genes identified based on this expression data can differentiate between people with diseased and healthy samples. These genes can serve as indicators or biomarkers for a biological state, such as a disease. This recipe talks about such an application to look for biomarkers using some learning techniques and statistical methods.

## Getting ready

We need a few preparations before we go ahead. The first is the dataset where we will look for the biomarker. We also need an R package called BioMark. In this recipe, we will generate the dataset artificially.

## How to do it…

The approach to find biomarkers from a dataset is illustrated as follows:

- Start by installing and loading the
*BioMark*library as follows:**> install.packages("BioMark") > library(BioMark)** - Once you have loaded the library, use the
*gen.data*function to simulate a dataset for use. By default, it simulated the data for five biomarkers, but this can be altered with the nbiom argument as follows:**> simdata <- gen.data(ncontrol = 10, nvar = 500, nsimul=1, group.diff = 1.5)** - Check the contents of the dataset as follows:
**> class(simdata) > str(simdata$X)** - The real biomarkers in the generated dataset are the first five variables; therefore, you can assign it to an R object for evaluation purposes, as follows:
> myrealMarkers <- c(1,2,3,4,5)

- Now, use the get.biom function to look for the biomarkers in the dataset. Use the
*pls*method for this purpose as follows:**> myBiom <- get.biom(X = simdata$X[1:20,,1], Y = simdata$Y,**

fmethod = "pls", type = "HC") - Once you have selected the biomarkers, take a look at the ranked variable's indices with the following command:
**> selection(myBiom)** - Once you have the results, observe the performance measure in terms of the ROC curve, as follows, for the top five biomarkers selected from the results:
**> myROC <- ROC(1/coef(myBiom)$pls[[1]][1:5], myrealMarkers) > plot(myROC)**

## How it works…

The BioMark library selects a biomarker from a dataset based on statistical tests or from regression-based methods and their corresponding coefficients. The *gen.data* function generates a dataset with a defined amount of control (ncontrol), and by default, the same number of treated instances (to change this, use the *ntreated* argument). This generates a certain number of datasets. We set this number to 1 by setting the nsimul argument to be equal to 1. The result is a list of data class labels and a number of biomarkers (the default is 5). In order to use the dataset, we extract the data to be used by selecting only one dataset, by specifying the proper array indices (see the X argument in step 5). The *get.biom* function does the required test or learning to filter out the set of biomarkers in the dataset. It is possible to use three different methods for inferring the biomarkers. Besides the **partial least square (PLS)** regression that we use here, the other possible methods are PCR and t-tests. The value returned is an ordered list of indices of the variables (in our case, genes or proteins) that are found to act as biomarkers in the dataset.

## There's more...

This recipe uses an existing package (*BioMark*). However, the process can be customized for a more tailored method, such as SVM or other techniques. This will involve a new implementation, which we will not present as a recipe here. Nevertheless, it is worthwhile to look at some of the approaches to do this in the literature.

# Summary

This article discusses recipes related to machine learning in bioinformatics.

## Resources for Article:

**Further resources on this subject:**

- First steps with R [article]
- Choosing Styles of Various Graph Elements in R [article]
- Aspects of Data Manipulation in R [article]