R Data Visualization Recipes

Installation and Introduction

Following recipes are covered in this chapter:

Installing and loading graphics packages
Using ggplot2, plotly, and ggvis
Making plots using primitives

Installing and loading graphics packages

Before starting, there are some habits you may want to cultivate in order to keep improving your R skills. First of all, whenever you program there may be some challenges to face. Usually those are tackled either by out-thinking the problem or by doing some research. You might want to remember what the problem was about and the solution, be that for times you face it again later or even for studying hours, keep a record of problems and solutions.

Speaking for me, making a library-like folder and gathering some commented examples on problems and resolutions was, and still is, of great help. Naming files properly and taking good use of comments (# are used to assign comments with R) makes the revision much easier.

R Markdown documents are pretty useful if want to keep a track of your own development and optionally publish for others to see. Publishing the learning process is a good way to self-promote. Also, keep in mind that R is a programming language and often those can correctly pull a problem out in more than one way, be open-minded to seek different solutions.

First things first, in order to make good use of a package, you need to install the package and know how to call a package's function.

If your R Session is running for a long time, there is a good chance that a bunch of packages are already loaded. Before installing or updating a package it's a good practice to restart R so that the installation won't mess with related loaded packages.

How to do it...

Run the following code to install the graphics packages properly:

> install.packages(c('devtools','plotly','ggvis'))
> devtools::install_github('hadley/ggplot2')

How it works...

Most of the book covers three graphic packages—ggplot2, plotly, and ggvis. In order to install a new package, you can type the function install.packages() into the console. That function works for packages available at CRAN-like repositories and local files. In order to install packages from local files, you need to name more than just the first argument. Entering ?install.packages into RStudio console shall lead you to the function documentation at the Help tab.

Instants after running the recipe, all the packages (devtools included) covered in this chapter might already be properly installed. Check the Packages tab in your RStudio application (speed up the search by typing into the search engine); if everything went fine, these four may be shown under User Library. Following image shows how it might look like:

Figure 1.1 - RStudio package window (bottom right corner).

If it fails, you may want to check the spelling and the internet connection. This function also gives some outputs that stand for warnings, progress reports, and results. Look for a message similar to package '<Package Name>' successfully unpacked and MD5 sums checked to make sure that all went fine. Checking the output is a good practice in order to know if the plan worked. It also give good clues about troubleshooting.

You may want to call a non-existing package (be creative here) and a package already installed and see what happens. Sometimes incompatibilities avoid proper download and installation. For example, missing Java or the proper architecture of Java may prevent you from installing the rJava package.

Realize that a package's name must be in the string format in order to work (remember to use ' '). It's also important to check the spelling. The function (calling and arguments) is case sensitive; if you miss even one letter or case, you will not find the desired package. Also note that the arguments where drew into a c() function. That is a vector (try ?c in the console).

? sign is actually a function that comes along base package called utils. Typing ?<function name> will always lead you to documentation whenever there is one to display. All functions coming from CRAN packages, base R and maybe the majority of GitHub ones have related documentation files, yet, if it's not base R do not forget to have the respective package already loaded. Alternatively you can also make calls like this: ?<package name>::<function name>.

As first argument of the install.packages() function, a vector of strings was given. That said, multiple packages can be downloaded and installed simultaneously. The same function might not install only the packages asked, but all the packages each of them rely on.

Once the packages are installed, you have a bunch of new functions at your disposal. In order to get to know these functions, you can seek the packages' documentation online. Usually, the documentations can be found at repositories (CRAN, GitHub, and so on).

Now with a bunch of new functions at hand, the next step is to call a function from a specific package. There are several ways of doing that. One possible way to do it is typing <package name>::<package function>, latest code block done that when called install_github(), a function from coming from devtools package, so it was called this way: devtools::install_github().

There are pros and cons about calling a function this way. As for pros, you mostly avoid any name conflict that could possible happen between packages. Other than that, you also avoid loading the whole package when you only need to call a single function. Thus, calling a function this way may be useful in two occasions:

Name conflict is expected
Only few functions from that package may be requested and only a few times

Otherwise, if a package is required many times, typing <package name>:: before every function is anti-productive. It's possible to load and attach the whole package at once. Via RStudio interface, right below the window that shows environment objects, there is a window with a package tab. Below the package tab it's possible to check the box in order to load a package and uncheck to detach them.

Try to detach ggplot2 by unchecking the box; keep an eye on that box. You can load packages using functions. The require() and library() functions can be assigned to this task. Both don't need ' ' in order to function well like install.packages() does, but if you call the package name as a string it stills works. Note that both functions can only load one package a time.

Although require() and library() work in a very similar way, they do not work exactly the same. If require() fails it throws a warning, library() on the other hand will trow an error. There is more, require() returns a logical value that stands for TRUE when the load succeeds and FALSE when it fails; library() returns no value.

For common loading procedures that is not a difference that should made into account, but if you want to create a function or loop that depends on loading a package and checking if it succeed, you may find easier to make it using require(). Using the logical operator & (and), it's possible to load all three packages at once and store the result in a single variable. Calling this variable will state TRUE if there is success for all and FALSE if a single one fails. This is done as follows:

> lcheck <- require(ggplot2) & require(plotly) & require(ggvis)
> lcheck

lcheck won't tell you which and how many packages failed. Try assigning c( require(ggplot2), require(plotly), reqruire(ggvis)) instead. Each element returning a FALSE is the package that is giving you trouble; this means better chances at troubleshooting.

For now you might be able to install R packages - from CRAN, Git repositories or local files - load and call a functions from an specific package. Now that you are familiar with R package's installation and loading procedures, the next section gives an introduction to the ggplot2 package framework.

There's more

Installation is also possible via RStudio features, which may seen more user friendly for newcomers. Open your RStudio, go to Tools > Install Packages..., type the packages' names (separate them with space or comma), and hit install. It fills the install.package() function and shows it in your console.

This is most indicated when you are not absolutely sure about the package name, but have a good clue. There is automatic suggestion thing that shall help you out to figure exactly what the package name is. You can also install packages from local files by using this feature. Look for an option called Install from and switch it to Package Archive File instead of Repository.

RStudios also gives you a Check For Packages Updates... option right below Install Packages... Hit it once in a while to make sure your packages are properly updated. Along with the packages to be updated it also shows what is new about them.

Using ggplot2, plotly, and ggvis

ggplot2, ggvis, and plotly have proven to be very useful graphical packages in the R universe. Each of them gained a respectful sum of popularity among R users, being recalled for the several graphical tasks each of them can handle in very elegant manners.

The purpose of this section is to give a brief introduction on the general framework of ggplot2 via some basic examples, and relate how to tackle similar quests using ggvis and plotly. Along the way, some pros and cons from each package will be highlighted.

Whenever you need to choose between some packages (and base R), it's important to balance the tasks each one were designed to handle, the amount of work it will require for you to achieve your goal (learning time included), and the time you actually have. It's also good to consider scale gains in future uses. For example, mastering ggplot2 may not seem a smart choice for a single time task but might pay-off if you're expecting lots of graphical challenges in the future.

Keep in mind that all the three packages are eligible for a large convoy of tasks. There are some jobs that a specific package is more suitable for and even some tasks that can be considered almost impracticable for others. This point will become clearer as the book goes on.

Getting ready

The only requirement this section holds is to have the ggplot2, ggvis, and plotly packages properly installed. Go back to Installing and loading graphics packages recipe if that is not the case. Once the installation is checked, it's time to know ggplot2 framework.

How to do it...

First things first, in order to plot using ggplot2, data must come from a data frame object. Data can come from more than one data frame but it's mandatory to have it arranged into objects from the data frame class.

We took the cars data set to fit this first graphic. It's good to actually get to know the data before plotting, so let's do it using the ?, class(), and head() functions:

> ?cars
> class(cars)
> head(cars)

Plots coming from ggplot2 can be stored by objects. They would fit two classes at same time, gg and ggplot:

> library(ggplot2)
> plot1 <- ggplot(cars, aes(x = speed,y = dist))

Objects created by the ggplot() function get to be from classes gg and ggplot at the same time. That said, you can to refer to a plot crafted by ggplot2 as a ggplot.

The three packages work more or less in a layered way. To add what we call layers to a ggplot, we can use the + operator:

 > plot1 + geom_point()

The + operator is in reality a function.

Result is shown by the following figure:

Figure 1.2 - Simple ggplot2 scatterplot.

Once you learn this framework, getting to know how ggvis works becomes much easier, and vice-versa. A similar graphic can be crafted with the following code:

> library(ggvis)
> ggvis(data = cars, x = ~speed, y = ~dist) %>% layer_points()

plotly would feel a little bit different, but it's not difficult at all to grasp how it works:

> library(plotly)
> plot_ly(data = cars, x = ~speed, y = ~dist, type = 'scatter', mode = 'markers')

Let's give these nuts and bolts some explanations.

How it works...

In order to have a brief data introduction, step 1 starts by calling ?cars. This is a very useful way to get to meet variables and background related to almost every data set coming from a package. Once ggplot2 requires data coming from data frames, class() function is checking if is that the case, answer is affirmative. At the end of this step head() function is checking upon the first six observations.

Moving on to step 2, after loading ggplot2, it demonstrates how to store the basic coordinate mapping and aesthetics into an object called plot1 (try it on the class() function). In order to set the basics, it uses a function (ggplot()) that initializes every single ggplot.

Storing a plot coming from ggplot2, ggvis, or plotly package into an object is optional, though very useful way to proceed.

To properly set ggplot(), start by declaring data set using data argument. After that, some basic aesthetics and coordinates are assigned. Different figures can ask and work along with different aesthetics, for the majority of cases those are named inside the aes() function.

As the books goes on you're going to get used to the ways how aesthetics can be declared-in or outside the aes() function. For now, let's acknowledged that inside aes() it's possible to call data frame variables by name and they may be displayed in legends.

Checking ?aes() shows "..." as argument, popularly known as three-dots but technically named ellipsis. It allows the user to pass an arbitrary number and variety of arguments. So as ggplot2 does lazy-evaluation (only evaluates arguments as they are requested, you could make up arguments and pass them into the aes() function with zero or only little trouble to the function. Perceive the following:

> plot1 <- ggplot(cars, aes(x = speed,y = dist, gorillaTroubleShooter = T, sight = 'Legolas'))

It would work as good as the earlier version. Just don't forget to name the arguments and you got yourself a good way to create some Easter eggs at your code (also a good way to confuse unaware developers). Both aes() and ggplot() play core roles in building graphics within this package.

Until step 2, only coordinate mapping was set at object named plot1, calling for it alone displays an empty graphic. Step 3 uses %+% to add a layer, the layer called (geom_point()) took care of fixing a geometry to the graphic. Besides the plus sign, ggplots are usually constructed by two families of functions (layers): geom_*and stat_*. While the first family comes with a fixed geometry and a default statistical transformation, the second one comes with fixed statistical transformations and a default geometry (this is grammar of graphics for real), defaults can be tweaked.

plot1 + stat_identity(geom = 'point') works just the same as step 3. Argument geom is set for 'point' as default for stat_identity(), it's fine to skip it. The reason I declared it was to reinforce that if you call for a statistical transformation you can pick the geometry and it goes the other way round (if you call for a geometry you can change the statistical transformation).

Behind the scene, geom_point() called the layer() function, which set a couple of arguments that culminated in the creation of a scatterplot. One may want to modify the axis labels and add a regression line. It can be done by simply adding more layers to the plot using the plus sign. One can stack as many layers desired, as shown next:

> plot1 + geom_point() +
> labs(x = "Speed (mpg)", y = "Distance (ft)") +
> geom_smooth(method = "lm", se = F) +
> scale_y_continuous(breaks = seq(0, 125, 25))

Result is exhibited by figure 1.3:

Figure 1.3 - Adding up several layers to a ggplot.

Combining ggplot2's sum operator (that is actually a function) and functions allows the user to make plots in a layered, iterative way. It splits complex graphics construction into several simple steps. It's also very intuitive and does not get any harder as you practice.

Yet, there are limitations. The difficulty to make interactive graphics by itselft may be one. These tasks, in the majority of the cases, are very well handled by both ggvis and plotly as stand alone packages. This leads us to steps 4 and 5.

Calling plotly::ggplotly() after bringing a ggplot up will coerce it into an interactive plot. It may fail sometimes. Do not forget to have plotly installed.

Step 4 loads ggvis package using library() and then gives birth to an interactive plot. It holds many similarities with ggplot2. Function ggvis() handles basic coordinating mapping while pipe operator (%>%) is used to add up a layer called by the layer_points() function. Remember, pipe operator and not plus sign.

ggvis understands different arguments declared using = (ever scaled) and := (never scaled). Also, ~ must come before the variable names.

Function names may change and also does the operator used to add up layers from ggplot2 to ggvis, but essentially the underlying logic keeps still. Layers coming from ggvis has several correspondences with ggplot2's ones; refer to the See also section to track some. In comparison with ggplot2, ggvis is much younger and some utilities may be yet to come, also data don't need to come from a data frame object.

Step 5 draws an interactive plotly graph. A single function (plot_ly()) takes care of coordinate mapping and geometry. It can be designed a little more layered using the add_traces() function, but there is no real need for that when the plot is too simple. Instead of having many functions demanding statistical transformations and geometries those are declared by arguments inside the main function.

These three packages, ggplot2, ggvis, and plotly, are well coded and powerful graphic packages. Right before picking one of them to handle a task do ever consider some points like:

What the package is able to do
Time needed to master the skill set required
Time required to handle the task
Amount of time available
Time to be saved later by the thing that you learned

Base R is also a feasible possibility. Whenever you face new challenges, it is a good thing to think through these points.

There's more

To have data coming solely from data frames is a strong restriction, but it does obligate the user to be explicit about the data and also draw a very clear line on what is ggplot2's concern (data visualization) and what is not (model visualization). In order to avoid headaches that come from downloading spreadsheets, setting up working directories, and loading data from files, we're taking an alternative way: getting data from packages instead.

data.frame() may be the most convenient function to coerce vectors into data frames in R.

By doing this, we ensure that the readers only need to reach the R's console to reproduce recipes; we want nothing to do with web browsers (we're too cool for school, school meaning web browsers). We shall follow this approach to the end of the book. This recipe look over datasets base packages to do so. ggplot2 has some data frames of its own.

Enter library(help = 'datasets') to general information on the other data sets.

It's also important to outline that the gg in the ggplot2 and ggvis refer to the Grammar of Graphics. That's a very important and inspiring theory that in had influenced ggplot2, ggvis, and plotly. The layered/iterative way that these packages handle plots might come from the Grammar of Graphics and makes graphics building much easier and reasonable. Learning this theory may give you heads into the process of learning these packages while learning these packages may give you heads when it comes to learn the Grammar of Graphics.

Making plots using primitives

Previously, a brief introduction on the frameworks of ggplot2, ggvis and plotly package was conducted. Next we are getting started with ggplot2 graphical primitives, using them in a series of recipes with related examples made with ggvis and plotly.

There are a total of eight graphical primitives at ggplot2, one of them already covered in this chapter (geom_point()). It's important to know the primitives well-what they do and when to use them. As fundamental building blocks, they play an essential role in the drawing process. A series of tasks can be handled relying on primitives when there is no dedicated function to handle some task; sometimes even if there is, primitives can handle it much better.

A good example are the dot plots. They have this dedicated geom_dotplot() function, but sometimes it is much easier to draw dot plots using geom_point(). Now, let's see how ggplot2 can brew figures using primitives and create related ones using ggvis and plotly.

How to do it...

After loading the package, primitives geom_point() and geom_path() can be stacked in order to plot lines with markers:

> library(ggplot2)
> plot1 <- ggplot( cars, aes(x = speed, y = dist))
> plot1 + geom_point() + geom_path()

The resulting output is shown by following figure:

Figure 1.4 - Lines with markers plot made by ggplot2's primitives.

Same mission can be nailed by the ggvis package, relying on the following code:

> library(ggvis)
> ggvis(cars, x = ~speed, y = ~dist) %>% layer_points() %>% layer_paths()

Following figure 1.5 displays a representation of the resulting graphic (only default theme will look different):

Figure 1.5 - Similar lines and markers plot done by ggvis.

Without using the translation function (ggplotly()) from plotly package, it's also possible to code a similar graphic from scratch relying only on plotly:

> library(plotly)
> plot_ly(cars, x = ~speed, y = ~dist, type = 'scatter', mode = 'lines+markers')

Following figure 1.6 exhibits a snapshot of the graphic brewed by the latest code:

Figure 1.6 - Similar lines and markers plot done by plotly.

Let's understand how these are unfolding.

How it works...

Complete list of ggplot2's primitives is given by geom_*: blank(), path(), ribbon(), polygon(), segment(), rect(), text(), and point(). Every primitive starts with geom_* but not every geom_* is a primitive. In fact, the better odds stands for quite the opposite.

More or less, geom_blank() seems to be the simplest of the primitives. Calling it right after setting ggplot() will display a blank plot with axis already adjusted. It's mostly used to check axes limits given by data itself. Maybe you can find it useful for another task; suit yourself.

Other primitives may work in a similar way. That is the case for geom_path(), geom_ribbon(), and geom_polygon() functions. The first one draws lines between coordinates, second one looks like the first but thicker, requiring additional aes() arguments (ymin and ymax). Last function draws filled polygons.

By setting only the starting and ending points, geom_segment() adds a segment line. geom_rect() adds a rectangle to the plot, requiring four corners to do so (xmin, xmax, ymin, and ymax). geom_text() add texts to the given coordinates. Some graphics displays only texts for each observations instead of points, also a good way to display additional information.

The remaining primitive is geom_point(). It's the only primitive direct called so far, it plot points at given coordinates. Two important points must be highlighted here. One, getting to know the primitives might give you an idea about which function you will require the most and which one the least, but that is not all that ggplot2 is capable of doing. Primitives are nothing but the building blocks used by other functions.

For the second point, as the previous recipe stated earlier, you can stack as many layers as you feel like. That is not less true for primitives functions, but it's good to know how they interact with one another. For example, calling geom_blank() after geom_point() may not override the points with a blank space.

After loading ggplot2 and setting base aes(), step 1 is creating a simple plot with lines and markers. While geom_point() displays the markers, geom_path() draws the lines between them. Note that the last function draws lines following the order given by data set rows, so we can call this function order-sensitive.

For many situations, reordering data will improve viz. This may be the case for dot, box, violin, bar plots, and others. If you want paths to be ordered within the x variable, geom_line() does that by itself, though it is not a primitive.

To this particular plot, the lines attach no meaning; they actually mislead. Lines are better designated to indicate some sort of order within the data, like chronological order. The only reason they were used was to demonstrate how primitives could be stacked to originate different viz from the one done before.

Step 2 is drawing a plot similar to the one crafted by step 1 but using ggvis instead. libray() loads the package while the ggvis() function is used to map the basic aesthetics. Following function (layer_points()) sets up the points to work as our markers and layer_paths() draws the lines between them.

Earlier section argued that ggvis is very similar to ggplot2 in the ways of coding graphics. This section actually demonstrated that. First, the function gets the data set and the variables are inputted as arguments. Pipe operators (%>%) are used instead of plus sign to stack up the layers, and layer_* works in a very similar way as geom_* does.

By step 3, a similar plotly graphic is crafted. Same function responsible for setting basic aesthetic mapping (plot_ly()) is also dealing geometries. Arguments type and mode set the geometries, both inputted with strings. These two arguments are meant to work together.

Setting type = 'scatter' enables the lines and markers modes. Each type has a whole particular convoy of modes attached to it; consult the reference manual to catch them all. The way we wanted to is to use markers and lines at same time so we built a string containing those two elements separated by the plus sign ('lines+markers'), and assigned it to mode argument.

mode = 'lines+markers' works as good as mode = 'markers+lines'. Modes can be stacked and order does not matter.

Figures 1.4 to 1.6 five resembles much a time series, but they aren't and it may give the wrong intuition. There are observations for two variables and neither one is time. Notice how for some speeds values there are up to 4 different distances to stop. Note that the cars data frame is ordered first by speed and then by distance, paths obey the row order showed by data while for point geometry order doesn't really matter.

Adding path geometry was misleading, geom_point() would be enough. Goal here was to demonstrate primitives interaction and not to give a meaningful figure. Next, let's build fictional data and draw a graphic that tells the story the right way. Picture a small classroom with only 7 students. The teacher builds a data frame with studying hours and grades for each student.

Data can be created like this:

> allnames <- c('Phill','Ross','Kate','Patrice','Peter','James','Monica')
> classr <- data.frame(names = allnames)
> classr$hours <- c(4, 16, 8, 11, 6, 14, 8)
> classr$grades <- c(4, 9.5, 6, 4, 6, 9, 7.5)

geom_text() primitive could be used to summon a meaningful graphic:

> library(ggplot2)
> plot2 <- ggplot( classr, aes(x = hours, y = grades))
> plot2 + geom_text( aes( labels = names))

The result would be like shown in the following figure 1.7:

Figure 1.7 - Plotting grades and hours as texts using ggplot2's primitive.

Related ggvis and plotly codes are shown next:

> library(ggvis)
> ggvis(classr, x = ~hours, y = ~grades, text := ~names) %>% layer_text()
> library(plotly)
> plot_ly(classr, x = ~hours, y = ~grades, type = 'scatter', mode = 'text', text = ~names)

This last brief example illustrates how to brew graphics using only primitives in a more meaningful way. It's very important to think about it. The better graphic is the one that tells the right story objectively and not the one with many layers.

There's more...

Did you know that both ggvis and plotly can guess which geometry you are looking for? Based on the basic aesthetics defined, they make a guess and adopt certain geometry. They look at how many variables of what kind (discrete or continuous) were inputted, and for some combinations they are able to make a guess.For the nearest example they would have guessed points geometry.

Figures breed by both packages will be displayed by the Viewer tab if you're using RStudio (They are interactive! Try hoovering the mouse over a plotly figure). Figures can be exported as web pages. Other than that, they can be exported as PNG, JPEG, and BMP, therefore losing the interactive property.

This recipe aimed to demonstrate how to construct plots using ggplot2 primitives, and build similar graphs using other packages. A question you should always ask yourself is if the geometry adopted goes along with the data used. In other words, if the graphic tells the story that you are willing to.

The recipes's goal was to introduce you to the graphical primitives of ggplot2 and draw simple graphics by using only primitives. Additional goal was to draw related graphics using the ggvis and plotly packages.

The next chapters dive deeper; each one shall tackle some families of graphics, highlighting nuts and bolts in the way to building high quality plots. As the book advances, so does the complexity involved. At some point, we are going to be plotting interactive globes, 3D surfaces and developing web applications. I find it pretty sicking cool, hope you enjoy it.

Chapter 2, Plotting Two Continuous Variables, takes care of scatterplots. It's a very popular kind of plot, and very useful too, but there is a big problem: over-plotting. Following chapter will not only teach how to craft scatterplots, but also teach how to deal with such problem and how to improve scatters by deploying marginal plots. Let it rip!