**50%**off this eBook here

### Statistical Analysis with R — Save 50%

Take control of your data and produce superior statistical analysis with R.

The R Project for Statistical Computing (or just R for short) is a powerful data analysis tool. It is both a programming language and a computational and graphical environment.

R is free, open source software made available under the GNU General Public License. It runs on Mac, Windows, and Unix operating systems.

The official R website is available at the following site:

In this article by **John M. Quick**, author of the book Statistical Analysis with R, you will learn how to:

- Create different charts, graphs, and plots in R
- Customize your R visuals using text, colors, axes, and legends

## Statistical Analysis with R

Read more about this book |

*(For more resources on R, see here.)*

# Charts, graphs, and plots in R

R features several options for creating charts, graphs, and plots. In this article, we will explore the generation and customization of these visuals, as well as methods for saving and exporting them for use outside of R. The following visuals will be covered in this article:

- Bar graphs
- Scatterplots
- Line charts
- Box plots
- Histograms
- Pie charts

## Time for action — creating a bar chart

A **bar chart** or **bar graph** is a common visual that uses rectangles to depict the values of different items. Bar graphs are especially useful when comparing data over time or between diverse groups. Let us create a bar chart in R:

- Open R and set your working directory:

> #set the R working directory

> #replace the sample location with one that is relevant to you

> setwd("/Users/johnmquick/rBeginnersGuide/") - Use the
*barplot(...)*function to create a bar chart:

> #create a bar chart that compares the mean durations of

the battle methods

> #calculate the mean duration of each battle method

> meanDurationFire <- mean(subsetFire$DurationInDays)

> meanDurationAmbush <- mean(subsetAmbush$DurationInDays)

> meanDurationHeadToHead <-

mean(subsetHeadToHead$DurationInDays)

> meanDurationSurround <- mean(subsetSurround$DurationInDays)

> #use a vector to define the chart's bar values

> barAllMethodsDurationBars <- c(meanDurationFire,

meanDurationAmbush, meanDurationHeadToHead,

meanDurationSurround)

> #use barplot(...) to create and display the bar chart

> barplot(height = barAllMethodsDurationBars) - Your chart will be displayed in the graphic window, similar to the following:

*What just happened?*

You created your first graphic in R. Let us examine the *barplot(...)* function that we used to generate our bar chart, along with the new R components that we encountered.

### barplot(...)

We created a bar chart that compared the mean durations of battles between the different combat methods. As it turns out, there is only one required argument in the *barplot(...)* function. This *height* argument receives a series of values that specify the length of each bar. Therefore, the *barplot(...)* function, at its simplest, takes on the following form:

barplot(height = heightValues)

Accordingly, our bar chart function reflected this same format:

> barplot(height = barAllMethodsDurationBars)

### Vectors

We stored the heights of our chart's bars in a **vector** variable. In R, a vector is a series of data. R's *c(...)* function can be used to create a vector from one or more data points. For example, the numbers 1, 2, 3, 4, and 5 can be arranged into a vector like so:

> #arrange the numbers 1, 2, 3, 4, and 5 into a vector

> numberVector <- c(1, 2, 3, 4, 5)

Similarly, text data can also be placed into vector form, so long as the values are contained within quotation marks:

> #arrange the letters a, b, c, d, and e into a vector

> textVector <- c("a", "b", "c", "d", "e")

Our vector defined the values for our bars:

> #use a vector to define the chart's bar values

> barAllMethodsDurationBars <- c(meanDurationFire, meanDurationAmbush, meanDurationHeadToHead, meanDurationSurround)

Many function arguments in R require vector input. Hence, it is very common to use and encounter the *c(...)* function when working in R.

### Graphic window

When you executed your *barplot(...)* function in the R console, the **graphic window** opened to display it. The graphic window will have different names across different operating systems, but its purpose and function remain the same. For example, in Mac OS X, the graphic window is named *Quartz*.

For the remainder of this article, all R graphics will be displayed without the graphics window frame, which will allow us to focus on the visuals themselves.

## Pop quiz

- When entering text into a vector using the
*c(...)*function, what characters must surround each text value?- Quotation marks
- Parenthesis
- Asterisks
- Percent signs

- What is the purpose of the R graphic window?
- To debug graphics functions
- To execute graphics functions
- To edit graphics
- To display graphics

Read more about this book |

*(For more resources on R, see here.)*

# Time for action — customizing graphics

Although the *barplot(...)* function only requires the height of each bar to be specified, creating a chart in this manner leaves us with a bland and difficult to decipher visual. In most cases, you will want to customize your R graphics by incorporating additional arguments into your functions. Let us explore how to use graphic customization arguments by expanding our bar chart:

- Expand your bar chart using graphic customization arguments:

> #use additional arguments to customize a graphic

> #define a title for the bar chart

> barAllMethodsDurationLabelMain <- "Average Duration by Battle Method"

> #define x and y axis labels for the bar chart

> barAllMethodsDurationLabelX <- "Battle Method"

> barAllMethodsDurationLabelY <- "Duration in Days"

> #set the x and y axis scales

> barAllMethodsDurationLimX <- c(0, 5)

> barAllMethodsDurationLimY <- c(0, 120)

> #define rainbow colors for the bars

> barAllMethodsDurationRainbowColors <- rainbow(length(barAllMethodsDurationBars))

> #incorporate customizations into the graphic function using the main, xlab, ylab, xlim, ylim, names, and col arguments

> #use barplot(...) to create and display the bar chart

> barplot(height = barAllMethodsDurationBars,

main = barAllMethodsDurationLabelMain,

xlab = barAllMethodsDurationLabelX,

ylab = barAllMethodsDurationLabelY,

xlim = barAllMethodsDurationLimX,

ylim = barAllMethodsDurationLimY,

col = barAllMethodsDurationRainbowColors) - Your chart will be displayed in the graphic window, as shown in the following screenshot:

- Add a legend to the chart, using the following snippet:

> #add a legend to the bar chart

> #the x and y arguments position the legend

> #x and y can be defined using words or numerical coordinates

> #the legend argument receives a vector containing the labels for the legend

> barAllMethodsDurationLegendLabels <- c("Fire", "Ambush", "Head to Head", "Surround")

> #the fill argument contains the colors for the legend

> legend(x = 0, y = 120, legend = barAllMethodsDurationLegendLabels,

fill = barAllMethodsDurationRainbowColors) - Your legend will be added to the existing chart.

*What just happened?*

The *barplot(...)* function, as well as the other graphic functions that we will use in this article, accept a variable number of arguments. In fact, R graphics functions have many customizable options and therefore tend to accept several arguments. We expanded our bar chart using a collection of the most common customization arguments, which apply to nearly all R graphics functions.

### Graphic customization arguments

We used six arguments to customize our bar chart:

*main*: a text title for the graphic*xlab*: a text label for the x axis*ylab*: a text label for the y axis*xlim*: a vector containing the lower and upper limits for the x axis*ylim*: a vector containing the lower and upper limits for the y axis*col*: a vector containing the colors to be used in the graphic

The general format for these arguments is as follows:

argument = value

When incorporated into a graphics function, these arguments take on the following form:

graphicsFunction(..., argument = value)

Recognize that these six arguments can be applied to nearly every R graphics function. Each one can be used alone or they can be used in tandem. We will use these arguments throughout the article to refine and improve our visuals.

### main, xlab, and ylab

The *main*, *xlab*, and *ylab* arguments are all used to add clarifying text to graphics. A primary title for a graphic is defined by *main*, while labels for the x and y axes are specified using *xlab* and *ylab*, respectively.

Our *barplot(...)* function made use of the *main*, *xlab*, and *ylab* arguments. We saved our argument values into variables prior to incorporating them into the *barplot(...)* function. First, we defined our text values as variables.

> #define a title for the bar chart

> barAllMethodsDurationLabelMain <- "Average Duration by Battle Method"

> #define x and y axis labels for the bar chart

> barAllMethodsDurationLabelX <- "Battle Method"

> barAllMethodsDurationLabelY <- "Duration in Days"

Then, we used our variables in the final *barplot(...)* function:

> barplot(height = barAllMethodsDurationBars,

main = barAllMethodsDurationLabelMain,

xlab = barAllMethodsDurationLabelX,

ylab = barAllMethodsDurationLabelY,

xlim = barAllMethodsDurationLimX,

ylim = barAllMethodsDurationLimY,

col = barAllMethodsDurationRainbowColors)

This variable technique has the advantages of rendering our code more decipherable and making it easier for us to return to and reuse our data in future graphics.

### xlim and ylim

The *xlim* and *ylim* arguments receive a vector containing the minimum and maximum values for the x and y axes respectively. Thus, in:

xlim = c(50, 250)

A graphic's x axis is told to present the data that fall between 50 and 250. The *ylim* argument operates in identical fashion to *xlim*, with the exception that it acts upon the y axis. These arguments are useful for rescaling a graphic's axes to improve its visual presentation. They can also have the effect of emphasizing or deemphasizing certain data ranges.

In our chart, we used *xlim* to set a minimum of 0 and a maximum of 5 for the x axis. This evenly and comfortably spaced our bars within the graphic window. We used *ylim* to set a minimum of 0 and maximum of 120 for the y axis. This ensured that all of our data were represented and that our bars were displayed at a reasonable height.

> barplot(height = barAllMethodsDurationBars,

main = barAllMethodsDurationLabelMain,

xlab = barAllMethodsDurationLabelX,

ylab = barAllMethodsDurationLabelY,

xlim = barAllMethodsDurationLimX,

ylim = barAllMethodsDurationLimY,

col = barAllMethodsDurationRainbowColors)

### Col

R can generate colors in two different forms using *Col*; they can be rainbow colors which are automatic, or you can specify colors of your choice.

#### Rainbow colors

R can generate an automatic sequence of colors for a chart with the *rainbow(...)* function. For our purposes, we simply identified the number of colors that we wished to generate for our chart. To obtain the appropriate number of colors, we used the *length(object)* command. This function tells us the number of items contained in a given object. In our case, using *length(object)* on the *barAllMethodsDurationBars* yielded a result of 4, which represents each of our chart's bars:

> barAllMethodsDurationSpecificColors <- rainbow(length(barAllMethodsDurationBars))

Consequently, the *rainbow(...)* function generated four colors. These colors were applied to the chart's bars when we included the *barAllMethodsDurationRainbowColors* variable in the *col* argument of our *barplot(...)* function.

> barplot(height = barAllMethodsDurationBars,

main = barAllMethodsDurationLabelMain,

xlab = barAllMethodsDurationLabelX,

ylab = barAllMethodsDurationLabelY,

xlim = barAllMethodsDurationLimX,

ylim = barAllMethodsDurationLimY,

col = barAllMethodsDurationRainbowColors)

#### Specific colors

Alternatively, specific colors can be defined using the *col* argument in tandem with a vector list of color names. Common color names such as red, green, blue, and yellow are valid inputs. In this situation, the *col* argument takes on the following form:

col = colorVector

Where *colorVector* is a variable storing a vector of color values like the following:

c("red", "green", "blue", "yellow")

*You can see a complete list of the colors available in R by executing the colors() function.*

Had we wanted to use specific colors in our bar chart, we could have employed the following code:

> #define specific colors for the bars

> barAllMethodsDurationSpecificColors <- c("red", "green", "blue", "yellow")

> #use barplot(...) to create and display the bar chart

> barplot(height = barAllMethodsDurationBars,

main = barAllMethodsDurationLabelMain,

xlab = barAllMethodsDurationLabelX,

ylab = barAllMethodsDurationLabelY,

xlim = barAllMethodsDurationLimX,

ylim = barAllMethodsDurationLimY,

col = barAllMethodsDurationSpecificColors)

### legend(...)

The finishing touch to our bar chart was a legend, or key, that indicated what our bars represented. In R, the *legend(...)* function employs the following arguments:

*x*: the x position of the chart in numeric terms; alternatively you can set the overall position of the legend using one of the text values*topleft*,*top*,*topright*,*left*,*center*,*right*,*bottomleft*,*bottomcenter*, or*bottomright**y*: the y position of the chart in numeric terms; if text is used for x, omit this argument*legend*: a vector containing the labels to be used in the legend*fill*: a vector containing the colors to be used in the legend

The basic format for the legend function is as follows:

legend(x = xPosition, y = yPosition, legend = labelVector, fill = colorVector)

For instance, the following code:

> legend(x = "topleft", legend = c("a", "b"), fill = rainbow(2))

This would yield a legend placed at the top-left position with labels for a and b whose colors were generated by the *rainbow(...)* function. Note that the x argument used a text value and y was omitted as an alternative to defining the exact numerical position of the legend.

Our function used the x and y coordinates from our chart to position the legend in the upper left-hand corner. When using numbers to define the x and y arguments, the values will always depend on the limits of the x and y axes. For instance, a position of (0, 120) specified the upper left-hand corner in our chart, but a graphic with a maximum y value of 50 would have an upper left-hand corner position of (0, 50). Our *legend* and *fill* arguments incorporated the same labels and colors that were used to generate our bar chart. Thus, our legend was matched to the information depicted in our chart:

> legend(x = 0, y = 120,

legend = barAllMethodsDurationLegendLabels,

fill = barAllMethodsDurationRainbowColors)

Notice the peculiar implementation of the legend(...) function, which we have not previously encountered. As we will see with other graphics functions, legend(...) does not stand alone. To be properly employed, a compatible graphic must already exist for *legend(...)* to act upon. In this situation, *legend(...)* adds a new legend on top of the visual that is displayed in the graphic window. However, if no graphic is currently displayed when the *legend(...)* function is executed, an error message is returned. This is demonstrated in the following code:

> #using the legend(...) function when no graphic already exists

results in the following error

> legend(x = "topleft", legend = c("a", "b"), fill = rainbow(2))

Error in strwidth(legend, units = "user", cex = cex) :

plot.new has not been called yet

Therefore, to add a legend to your graphics in R, be sure to always create the graphic first, then apply the *legend(...)* function.

## Pop quiz

- An
*xlim*value of*c(100, 300)*means which of the following?- Present the data that are not equal to 100 or 300 on the x axis.
- Present the data that are equal to 100 or 300 on the x axis.
- Present the data that are less than 100 or greater than 300 on the x axis.
- Present the data that are between 100 and 300 on the x axis.

- When should the
*legend(...)*function be called?- Before a graphic function is called.
- During a graphic function, included as an argument.
- After a graphic function.
- When a compatible graphic is displayed in the graphic window.

# Time for action — creating a scatterplot

A **scatterplot** is a fundamental statistics graphic that can be used to better understand the relationships underlying a dataset. Like descriptive statistics and correlations, scatterplots are especially useful as a precursor to more extensive data analyses, such as linear regression modeling. We can use R to generate scatterplots that depict a single relationship between two variables or the relationships between all of the variables in a dataset. We will practice both of these methods:

- Use the
*plot(...)*function to create a scatterplot depicting a single relationship between two variables:

> #create a scatterplot that depicts the relationship between

the number of Shu and Wei soldiers engaged in past fire attacks

> #get the data to be used in the plot

> scatterplotFireWeiSoldiersData <- subsetFire$WeiSoldiers

> scatterplotFireShuSoldiersData <- subsetFire$ShuSoldiers

> #customize the plot

> scatterplotFireSoldiersLabelMain <- "Soldiers Engaged in Past Fire Attacks"

> scatterplotFireSoldiersLabelX <- "Wei"

> scatterplotFireSoldiersLabelY <- "Shu"

> #use plot(...) to create and display the scatterplot

> plot(x = scatterplotFireWeiSoldiersData,

y = scatterplotFireShuSoldiersData,

main = scatterplotFireSoldiersLabelMain,

xlab = scatterplotFireSoldiersLabelX,

ylab = scatterplotFireSoldiersLabelY) - Your plot will be displayed in the graphic window, as shown in the following:

- Use the
*plot(...)*function to simultaneously depict the relationships between all of the variables in the dataset:

> #create a scatterplot that depicts the relationships between all of the variables in our fire attack dataset

> plot(x = subsetFire) - A grouping of several plots will be displayed in the graphic window:

*What just happened?*

We created two scatterplots using R's *plot(...)* function, one portraying a single relationship and one displaying all of the relationships in our dataset.

### Single scatterplot

To plot a single relationship between two variables, use R's *plot(...)* function. The primary arguments for *plot(...)* are:

*x*: the variable to be plotted on the x axis*y*: the variable to be plotted on the y axis

Thus, the simplest form of *plot(...)* contains arguments only for the x and y variables, and is as shown:

plot(x = xVariable, y = yVariable)

We used the *plot(...)* function to visualize the relationship between the number of Shu and Wei soldiers involved in past fire attacks. To add relevant text to our graphic, we included the *main*, *xlab*, and *ylab* arguments:

> plot(scatterplotFireWeiSoldiersData,

scatterplotFireShuSoldiersData,

main = scatterplotFireSoldiersLabelMain,

xlab = scatterplotFireSoldiersLabelX,

ylab = scatterplotFireSoldiersLabelY)

### Multiple scatterplots

We also used the *plot(...)* function to simultaneously explore all of the relationships within our dataset. This yielded a graphic that contained a scatterplot for every variable pair. The format for creating this type of scatterplot is:

plot(x = dataset)

Where *dataset* is a set of data containing multiple variables. For us, the *dataset* argument contained our fire attack data.

> plot(x = subsetFire)

The resulting plot allowed us to visualize all of the relationships between our variables in a single graphic.

## Pop quiz

- Assume that a and b are data variables. Which of the following best describes the graphic that would result from the following line of code?

> plot(x = a, y = b)

- A scatterplot with a on the x axis and b on the y axis.
- A scatterplot with b on the x axis and a on the y axis.
- A scatterplot containing all of the relationships in the dataset.
- A scatterplot containing none of the relationships in the dataset.

- Assume that a is a dataset. Which of the following best describes the graphic that would result from the following line of code?

> plot(x = a)

- A scatterplot with a on the x axis.
- A scatterplot with a on the y axis.
- A scatterplot containing all of the relationships in the dataset.
- A scatterplot containing none of the relationships in the dataset.

# Summary

In this article, you created several charts, graphs, and plots. This process entailed using R's graphical prowess to generate, customize, and export visual representations of your data. At this point, you should be able to:

- Use R to create various charts, graphs, and plots
- Customize your R visuals using colors, lines, and symbols

In the next article we will take a look at some more charts, graphs, and plots in R. We will also take a look at exporting graphics for use outside of R.

**Further resources on this subject:**

- Graphical Capabilities of R [article]
- Organizing, Clarifying and Communicating the R Data Analyses [article]

## About the Author :

## John M. Quick

He is an Educational Technology doctoral student at Arizona State University who is interested in the design, research, and use of educational innovations. Currently, his work focuses on mobile, game-based, and global learning, interactive mixed-reality systems, and innovation adoption. John's blog, which provides articles, tutorials, reviews, perspectives, and news relevant to technology and education, is available from http://www.johnmquick.com. In his spare time, John enjoys photography, nature, and travel.