In this article by **Viswa Viswanathan** and **Shanthi Viswanathan**, the authors of the book R Data Analysis Cookbook, we discover how R can be used in various ways such as comparison, classification, applying different functions, and so on.

We will cover the following recipes:

- Creating charts that facilitate comparisons
- Building, plotting, and evaluating – classification trees
- Using time series objects
- Applying functions to subsets of a vector

*(For more resources related to this topic, see here.)*

# Creating charts that facilitate comparisons

In large datasets, we often gain good insights by examining how different segments behave. The similarities and differences can reveal interesting patterns. This recipe shows how to create graphs that enable such comparisons.

## Getting ready

If you have not already done so, download the code files and save the *daily-bike-rentals.csv* file in your R working directory. Read the data into R using the following command:

> bike <- read.csv("daily-bike-rentals.csv") > bike$season <- factor(bike$season, levels = c(1,2,3,4), labels = c("Spring", "Summer", "Fall", "Winter")) > attach(bike)

## How to do it...

We base this recipe on the task of generating histograms to facilitate the comparison of bike rentals by season.

### Using base plotting system

We first look at how to generate histograms of the count of daily bike rentals by season using R's base plotting system:

- Set up a 2 X 2 grid for plotting histograms for the four seasons:

> par(mfrow = c(2,2))

- Extract data for the seasons:

> spring <- subset(bike, season == "Spring")$cnt > summer <- subset(bike, season == "Summer")$cnt > fall <- subset(bike, season == "Fall")$cnt > winter <- subset(bike, season == "Winter")$cnt

- Plot the histogram and density for each season:

> hist(spring, prob=TRUE, xlab = "Spring daily rentals", main = "") > lines(density(spring)) > > hist(summer, prob=TRUE, xlab = "Summer daily rentals", main = "") > lines(density(summer)) > > hist(fall, prob=TRUE, xlab = "Fall daily rentals", main = "") > lines(density(fall)) > > hist(winter, prob=TRUE, xlab = "Winter daily rentals", main = "") > lines(density(winter))

You get the following output that facilitates comparisons across the seasons:

### Using ggplot2

We can achieve much of the preceding results in a single command:

> qplot(cnt, data = bike) + facet_wrap(~ season, nrow=2) + geom_histogram(fill = "blue")

You can also combine all four into a single histogram and show the seasonal differences through coloring:

> qplot(cnt, data = bike, fill = season)

## How it works...

When you plot a single variable with *qplot*, you get a histogram by default. Adding *facet* enables you to generate one histogram per level of the chosen facet. By default, the four histograms will be arranged in a single row. Use *facet_wrap* to change this.

## There's more...

You can use *ggplot2* to generate comparative boxplots as well.

### Creating boxplots with ggplot2

Instead of the default histogram, you can get a boxplot with either of the following two approaches:

> qplot(season, cnt, data = bike, geom = c("boxplot"), fill = season) > > ggplot(bike, aes(x = season, y = cnt)) + geom_boxplot()

The preceding code produces the following output:

The second line of the preceding code produces the following plot:

# Building, plotting, and evaluating – classification trees

You can use a couple of R packages to build classification trees. Under the hood, they all do the same thing.

## Getting ready

If you do not already have the *rpart*, *rpart.plot*, and *caret* packages, install them now. Download the data files and place the *banknote-authentication.csv file* in your R working directory.

## How to do it...

This recipe shows you how you can use the *rpart* package to build classification trees and the *rpart.plot* package to generate nice-looking tree diagrams:

- Load the
*rpart*,*rpart.plot,*and*caret*packages:

> library(rpart) > library(rpart.plot) > library(caret)

- Read the data:

> bn <- read.csv("banknote-authentication.csv")

- Create data partitions. We need two partitions—training and validation. Rather than copying the data into the partitions, we will just keep the indices of the cases that represent the training cases and subset as and when needed:

> set.seed(1000) > train.idx <- createDataPartition(bn$class, p = 0.7, list = FALSE)

- Build the tree:

> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))

- View the text output (your result could differ if you did not set the random seed as in step 3):

> mod n= 961 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 961 423 0 (0.55983351 0.44016649) 2) variance>=0.321235 511 52 0 (0.89823875 0.10176125) 4) curtosis>=-4.3856 482 29 0 (0.93983402 0.06016598) 8) variance>=0.92009 413 10 0 (0.97578692 0.02421308) * 9) variance< 0.92009 69 19 0 (0.72463768 0.27536232) 18) entropy< -0.167685 52 6 0 (0.88461538 0.11538462) * 19) entropy>=-0.167685 17 4 1 (0.23529412 0.76470588) * 5) curtosis< -4.3856 29 6 1 (0.20689655 0.79310345) 10) variance>=2.3098 7 1 0 (0.85714286 0.14285714) * 11) variance< 2.3098 22 0 1 (0.00000000 1.00000000) * 3) variance< 0.321235 450 79 1 (0.17555556 0.82444444) 6) skew>=6.83375 76 18 0 (0.76315789 0.23684211) 12) variance>=-3.4449 57 0 0 (1.00000000 0.00000000) * 13) variance< -3.4449 19 1 1 (0.05263158 0.94736842) * 7) skew< 6.83375 374 21 1 (0.05614973 0.94385027) 14) curtosis>=6.21865 106 16 1 (0.15094340 0.84905660) 28) skew>=-3.16705 16 0 0 (1.00000000 0.00000000) * 29) skew< -3.16705 90 0 1 (0.00000000 1.00000000) * 15) curtosis< 6.21865 268 5 1 (0.01865672 0.98134328) *

- Generate a diagram of the tree (your tree might differ if you did not set the random seed as in step 3):

> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")

The following output is obtained as a result of the preceding command:

- Prune the tree:

> # First see the cptable > # !!Note!!: Your table can be different because of the > # random aspect in cross-validation > mod$cptable CP nsplit rel error xerror xstd 1 0.69030733 0 1.00000000 1.0000000 0.03637971 2 0.09456265 1 0.30969267 0.3262411 0.02570025 3 0.04018913 2 0.21513002 0.2387707 0.02247542 4 0.01891253 4 0.13475177 0.1607565 0.01879222 5 0.01182033 6 0.09692671 0.1347518 0.01731090 6 0.01063830 7 0.08510638 0.1323877 0.01716786 7 0.01000000 9 0.06382979 0.1276596 0.01687712 > # Choose CP value as the highest value whose > # xerror is not greater than minimum xerror + xstd > # With the above data that happens to be > # the fifth one, 0.01182033 > # Your values could be different because of random > # sampling > mod.pruned = prune(mod, mod$cptable[5, "CP"])

- View the pruned tree (your tree will look different):

> prp(mod.pruned, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")

- Use the pruned model to predict for a validation partition (note the minus sign before
*train.idx*to consider the cases in the validation partition):

> pred.pruned <- predict(mod, bn[-train.idx,], type = "class")

- Generate the error/classification-confusion matrix:

> table(bn[-train.idx,]$class, pred.pruned, dnn = c("Actual", "Predicted")) Predicted Actual 0 1 0 213 11 1 11 176

## How it works...

Steps 1 to 3 load the packages, read the data, and identify the cases in the training partition, respectively. In step 3, we set the random seed so that your results should match those that we display.

Step 4 builds the classification tree model:

> mod <- rpart(class ~ ., data = bn[train.idx, ], method = "class", control = rpart.control(minsplit = 20, cp = 0.01))

The *rpart()* function builds the tree model based on the following:

- Formula specifying the dependent and independent variables
- Dataset to use
- A specification through
*method="class"*that we want to build a classification tree (as opposed to a regression tree) - Control parameters specified through the
*control = rpart.control()*setting; here we have indicated that the tree should only consider nodes with at least 20 cases for splitting and use the complexity parameter value of 0.01—these two values represent the defaults and we have included these just for illustration

Step 5 produces a textual display of the results. Step 6 uses the *prp()* function of the *rpart.plot* package to produce a nice-looking plot of the tree:

> prp(mod, type = 2, extra = 104, nn = TRUE, fallen.leaves = TRUE, faclen = 4, varlen = 8, shadow.col = "gray")

- use
*type=2*to get a plot with every node labeled and with the split label below the node - use
*extra=4*to display the probability of each class in the node (conditioned on the node and hence summing to 1); add 100 (hence*extra=104*) to display the number of cases in the node as a percentage of the total number of cases - use
*nn = TRUE*to display the node numbers; the root node is node number*1*and node*n*has child nodes numbered*2n*and*2n+1* - use
*fallen.leaves=TRUE*to display all leaf nodes at the bottom of the graph - use
*faclen*to abbreviate class names in the nodes to a specific maximum length - use
*varlen*to abbreviate variable names - use
*shadow.col*to specify the color of the shadow that each node casts

Step 7 prunes the tree to reduce the chance that the model too closely models the training data—that is, to reduce overfitting. Within this step, we first look at the complexity table generated through cross-validation. We then use the table to determine the cutoff complexity level as the largest *xerror* (cross-validation error) value that is not greater than one standard deviation above the minimum cross-validation error.

Steps 8 through 10 display the pruned tree; use the pruned tree to predict the class for the validation partition and then generate the error matrix for the validation partition.

## There's more...

We discuss in the following an important variation on predictions using classification trees.

### Computing raw probabilities

We can generate probabilities in place of classifications by specifying *type="prob"*:

> pred.pruned <- predict(mod, bn[-train.idx,], type = "prob")

### Create the ROC Chart

Using the preceding raw probabilities and the class labels, we can generate a ROC chart:

> pred <- prediction(pred.pruned[,2], bn[-train.idx,"class"]) > perf <- performance(pred, "tpr", "fpr") > plot(perf)

# Using time series objects

In this recipe, we look at various features to create and plot time-series objects. We will consider data with both a single and multiple time series.

## Getting ready

If you have not already downloaded the data files, do it now and ensure that the files are in your R working directory.

## How to do it...

- Read the data. The file has 100 rows and a single column named
*sales*:

> s <- read.csv("ts-example.csv")

- Convert the data to a simplistic time series object without any explicit notion of time:

> s.ts <- ts(s) > class(s.ts) [1] "ts"

- Plot the time series:

> plot(s.ts)

- Create a proper time series object with proper time points:

> s.ts.a <- ts(s, start = 2002) > s.ts.a Time Series: Start = 2002 End = 2101 Frequency = 1 sales [1,] 51 [2,] 56 [3,] 37 [4,] 101 [5,] 66 (output truncated) > plot(s.ts.a) > # results show that R treated this as an annual > # time series with 2002 as the starting year

The result of the preceding commands is seen in the following graph:

To create a monthly time series run the following command:

> # Create a monthly time series > s.ts.m <- ts(s, start = c(2002,1), frequency = 12) > s.ts.m Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 2002 51 56 37 101 66 63 45 68 70 107 86 102 2003 90 102 79 95 95 101 128 109 139 119 124 116 2004 106 100 114 133 119 114 125 167 149 165 135 152 2005 155 167 169 192 170 180 175 207 164 204 180 203 2006 215 222 205 202 203 209 200 199 218 221 225 212 2007 250 219 242 241 267 249 253 242 251 279 298 260 2008 269 257 279 273 275 314 288 286 290 288 304 291 2009 314 290 312 319 334 307 315 321 339 348 323 342 2010 340 348 354 291 > plot(s.ts.m) # note x axis on plot

The following plot can be seen as a result of the preceding commands:

> # Specify frequency = 4 for quarterly data > s.ts.q <- ts(s, start = 2002, frequency = 4) > s.ts.q Qtr1 Qtr2 Qtr3 Qtr4 2002 51 56 37 101 2003 66 63 45 68 2004 70 107 86 102 2005 90 102 79 95 2006 95 101 128 109 (output truncated) > plot(s.ts.q)

- Query time series objects (we use the
*s.ts.m*object we created in the previous step):

> # When does the series start? > start(s.ts.m) [1] 2002 1 > # When does it end? > end(s.ts.m) [1] 2010 4 > # What is the frequency? > frequency(s.ts.m) [1] 12

- Create a time series object with multiple time series. This data file contains US monthly consumer prices for white flour and unleaded gas for the years 1980 through 2014 (downloaded from the website of the US Bureau of Labor Statistics):

> prices <- read.csv("prices.csv") > prices.ts <- ts(prices, start=c(1980,1), frequency = 12)

- Plot a time series object with multiple time series:

> plot(prices.ts)

The plot in two separate panels appears as follows:

> # Plot both series in one panel with suitable legend > plot(prices.ts, plot.type = "single", col = 1:2) > legend("topleft", colnames(prices.ts), col = 1:2, lty = 1)

Two series plotted in one panel appear as follow:

## How it works...

Step 1 reads the data.

Step 2 uses the *ts* function to generate a time series object based on the raw data.

Step 3 uses the *plot* function to generate a line plot of the time series. We see that the time axis does not provide much information. Time series objects can represent time in more friendly terms.

Step 4 shows how to create time series objects with a better notion of time. It shows how we can treat a data series as an annual, monthly, or quarterly time series. The *start* and *frequency* parameters help us to control these data series.

Although the time series we provide is just a list of sequential values, in reality our data can have an implicit notion of time attached to it. For example, the data can be annual numbers, monthly numbers, or quarterly ones (or something else, such as 10-second observations of something). Given just the raw numbers (as in our data file, *ts-example.csv*), the *ts* function cannot figure out the time aspect and by default assumes no secondary time interval at all.

We can use the *frequency* parameter to tell *ts* how to interpret the time aspect of the data. The *frequency* parameter controls how many secondary time intervals there are in one major time interval. If we do not explicitly specify it, by default *frequency* takes on a value of 1. Thus, the following code treats the data as an annual sequence, starting in 2002:

> s.ts.a <- ts(s, start = 2002)

The following code, on the other hand, treats the data as a monthly time series, starting in January 2002. If we specify the *start* parameter as a number, then R treats it as starting at the first subperiod, if any, of the specified *start* period. When we specify *frequency* as different from 1, then the *start* parameter can be a vector such as *c(2002,1)* to specify the series, the major period, and the subperiod where the series starts. *c(2002,1)* represent January 2002:

> s.ts.m <- ts(s, start = c(2002,1), frequency = 12)

Similarly, the following code treats the data as a quarterly sequence, starting in the first quarter of 2002:

> s.ts.q <- ts(s, start = 2002, frequency = 4)

The *frequency* values of 12 and 4 have a special meaning—they represent monthly and quarterly time sequences.

We can supply *start* and *end*, just one of them, or none. If we do not specify either, then R treats the *start* as 1 and figures out *end* based on the number of data points. If we supply one, then R figures out the other based on the number of data points.

While *start* and *end* do not play a role in computations, *frequency* plays a big role in determining seasonality, which captures periodic fluctuations.

If we have some other specialized time series, we can specify the *frequency* parameter appropriately. Here are two examples:

- With measurements taken every 10 minutes and seasonality pegged to the hour, we should specify
*frequency*as 6 - With measurements taken every 10 minutes and seasonality pegged to the day, use
*frequency = 24*6*(6 measurements per hour times 24 hours per day)

Step 5 shows the use of the functions *start*, *end*, and *frequency* to query time series objects.

Steps 6 and 7 show that R can handle data files that contain multiple time series.

# Applying functions to subsets of a vector

The *tapply* function applies a function to each partition of the dataset. Hence, when we need to evaluate a function over subsets of a vector defined by a factor, *tapply* comes in handy.

## Getting ready

Download the files and store the *auto-mpg.csv* file in your R working directory. Read the data and create factors for the *cylinders* variable:

> auto <- read.csv("auto-mpg.csv", stringsAsFactors=FALSE) > auto$cylinders <- factor(auto$cylinders, levels = c(3,4,5,6,8), labels = c("3cyl", "4cyl", "5cyl", "6cyl", "8cyl"))

## How to do it...

To apply functions to subsets of a vector, follow these steps:

- Calculate mean mpg for each cylinder type:

> tapply(auto$mpg,auto$cylinders,mean) 3cyl 4cyl 5cyl 6cyl 8cyl 20.55000 29.28676 27.36667 19.98571 14.96311

- We can even specify multiple factors as a list. The following example shows only one factor since the out file has only one, but it serves as a template that you can adapt:

> tapply(auto$mpg,list(cyl=auto$cylinders),mean) cyl 3cyl 4cyl 5cyl 6cyl 8cyl 20.55000 29.28676 27.36667 19.98571 14.96311

## How it works...

In step 1 the *mean* function is applied to the *auto$mpg* vector grouped according to the *auto$cylinders* vector. The grouping factor should be of the same length as the input vector so that each element of the first vector can be associated with a group.

The *tapply* function creates groups of the first argument based on each element's group affiliation as defined by the second argument and passes each group to the user-specified function.

Step 2 shows that we can actually group by several factors specified as a list. In this case, *tapply* applies the function to each unique combination of the specified factors.

## There's more...

The *by* function is similar to *tapply* and applies the function to a group of rows in a dataset, but by passing in the entire data frame. The following examples clarify this.

### Applying a function on groups from a data frame

In the following example, we find the correlation between *mpg* and *weight* for each cylinder type:

> by(auto, auto$cylinders, function(x) cor(x$mpg, x$weight)) auto$cylinders: 3cyl [1] 0.6191685 --------------------------------------------------- auto$cylinders: 4cyl [1] -0.5430774 --------------------------------------------------- auto$cylinders: 5cyl [1] -0.04750808 --------------------------------------------------- auto$cylinders: 6cyl [1] -0.4634435 --------------------------------------------------- auto$cylinders: 8cyl [1] -0.5569099

# Summary

Being an extensible system, R's functionality is divided across numerous packages with each one exposing large numbers of functions. Even experienced users cannot expect to remember all the details off the top of their head. In this article, we went through a few techniques using which R helps analyze data and visualize the results.

## Resources for Article:

**Further resources on this subject:**

- Combining Vector and Raster Datasets [article]
- Factor variables in R [article]
- Big Data Analysis (R and Hadoop) [article]