R is a free open language and environment for statistical computing and graphics. It particularly gained wide popularity among scientists from different fields, journalists, and private companies. There are various reasons for that, openness and gratuity may be couple of them. Also, R requires minimal programming background and has a vibrant online community.
From community, a bunch of useful graphical packages had come. This chapter covers basic aspects of three of them:
ggvis. The first one (
ggplot2) has been there for a long time, is very mature, and is very useful to build non-interactive graphics.
ggvis are much younger packages, which can build interactive plots. Both are shiny compatible and can well address the matter of web applications. Beginning with installation and loading, this chapter goes all the way through explaining the basic framework of all those three packages, while demonstrating how to use
Before starting, there are some habits you may want to cultivate in order to keep improving your R skills. First of all, whenever you program there may be some challenges to face. Usually those are tackled either by out-thinking the problem or by doing some research. You might want to remember what the problem was about and the solution, be that for times you face it again later or even for studying hours, keep a record of problems and solutions.
Speaking for me, making a library-like folder and gathering some commented examples on problems and resolutions was, and still is, of great help. Naming files properly and taking good use of comments (# are used to assign comments with R) makes the revision much easier.
R Markdowndocuments are pretty useful if want to keep a track of your own development and optionally publish for others to see. Publishing the learning process is a good way to self-promote. Also, keep in mind that R is a programming language and often those can correctly pull a problem out in more than one way, be open-minded to seek different solutions.
First things first, in order to make good use of a package, you need to install the package and know how to call a package's function.
If your R Session is running for a long time, there is a good chance that a bunch of packages are already loaded. Before installing or updating a package it's a good practice to restart R so that the installation won't mess with related loaded packages.
Run the following code to install the graphics packages properly:
> install.packages(c('devtools','plotly','ggvis')) > devtools::install_github('hadley/ggplot2')
Most of the book covers three graphic packages—
ggvis. In order to install a new package, you can type the function
install.packages() into the console. That function works for packages available at CRAN-like repositories and local files. In order to install packages from local files, you need to name more than just the first argument. Entering
?install.packages into RStudio console shall lead you to the function documentation at the
Instants after running the recipe, all the packages (
devtools included) covered in this chapter might already be properly installed. Check the
Packagestab in your RStudio application (speed up the search by typing into the search engine); if everything went fine, these four may be shown under
Library. Following image shows how it might look like:
Figure 1.1 - RStudio package window (bottom right corner).
If it fails, you may want to check the spelling and the internet connection. This function also gives some outputs that stand for warnings, progress reports, and results. Look for a message similar to
package '<Package Name>' successfully unpacked and MD5 sums checked to make sure that all went fine. Checking the output is a good practice in order to know if the plan worked. It also give good clues about troubleshooting.
You may want to call a non-existing package (be creative here) and a package already installed and see what happens. Sometimes incompatibilities avoid proper download and installation.For example, missing Java or the proper architecture of Java may prevent you from installing the
Realize that a package's name must be in the string format in order to work (remember to use
' '). It's also important to check the spelling. The function (calling and arguments) is case sensitive; if you miss even one letter or case, you will not find the desired package. Also note that the arguments where drew into a
c() function. That is a vector (try
?c in the console).
?sign is actually a function that comes along base package called utils. Typing
?<function name> will always lead you to documentation whenever there is one to display. All functions coming from CRAN packages, base R and maybe the majority of GitHub ones have related documentation files, yet, if it's not base R do not forget to have the respective package already loaded. Alternatively you can also make calls like this:
?<package name>::<function name>.
As first argument of the
install.packages() function, a vector of strings was given. That said, multiple packages can be downloaded and installed simultaneously. The same function might not install only the packages asked, but all the packages each of them rely on.
Once the packages are installed, you have a bunch of new functions at your disposal. In order to get to know these functions, you can seek the packages' documentation online. Usually, the documentations can be found at repositories (CRAN, GitHub, and so on).
Now with a bunch of new functions at hand, the next step is to call a function from a specific package. There are several ways of doing that. One possible way to do it is typing
<package name>::<package function>, latest code block done that when called
install_github(), a function from coming from
devtools package, so it was called this way:
There are pros and cons about calling a function this way. As for pros, you mostly avoid any name conflict that could possible happen between packages. Other than that, you also avoid loading the whole package when you only need to call a single function. Thus, calling a function this way may be useful in two occasions:
- Name conflict is expected
- Only few functions from that package may be requested and only a few times
Otherwise, if a package is required many times, typing
<package name>:: before every function is anti-productive. It's possible to load and attach the whole package at once. Via RStudio interface, right below the window that shows environment objects, there is a window with a
package tab. Below the
package tab it's possible to check the box in order to load a package and uncheck to detach them.
Try to detach
ggplot2 by unchecking the box; keep an eye on that box. You can load packages using functions. The
library() functions can be assigned to this task. Both don't need
' ' in order to function well like
install.packages() does, but if you call the package name as a string it stills works. Note that both functions can only load one package a time.
library() work in a very similar way, they do not work exactly the same. If
require() fails it throws a warning,
library() on the other hand will trow an error. There is more,
require() returns a logical value that stands for TRUE when the load succeeds and FALSE when it fails;
library() returns no value.
For common loading procedures that is not a difference that should made into account, but if you want to create a function or loop that depends on loading a package and checking if it succeed, you may find easier to make it using
require(). Using the logical operator
& (and), it's possible to load all three packages at once and store the resultin a single variable. Calling this variable will state TRUE if there is success for all and FALSE if a single one fails. This is done as follows:
> lcheck <- require(ggplot2) & require(plotly) & require(ggvis) > lcheck
lcheck won't tell you which and how many packages failed. Try assigning
c( require(ggplot2), require(plotly), reqruire(ggvis)) instead. Each element returning a FALSE is the package that is giving you trouble; this means better chances at troubleshooting.
For now you might be able to install R packages - from CRAN, Git repositories or local files - load and call a functions from an specific package. Now that you are familiar with R package's installation and loading procedures, the next section gives an introduction to the
ggplot2 package framework.
Installation is also possible via RStudio features, which may seen more user friendly for newcomers. Open your RStudio, go to
Tools > Install Packages..., type the packages' names (separate them with space or comma), and hit install. It fills the
install.package() function and shows it in your console.
This is most indicated when you are not absolutely sure about the package name, but have a good clue. There is automatic suggestion thing that shall help you out to figure exactly what the package name is. You can also install packages from local files by using this feature. Look for an option called
Install from and switch it to Package Archive File instead of
RStudios also gives you a
Check For Packages Updates... option right below
Install Packages... Hit it once in a while to make sure your packages are properly updated. Along with the packages to be updated it also shows what is new about them.
ggplot2tidyverse reference manual: http://ggplot2.tidyverse.org/reference/
ggvisCRAN-R documentaion: https://cran.r-project.org/web/packages/ggvis/ggvis.pdf
plotlyfigure reference: https://plot.ly/r/reference/
plotly have proven to be very useful graphical packages in the R universe. Each of them gained a respectful sum of popularity among R users, being recalled for the several graphical tasks each of them can handle in very elegant manners.
The purpose of this section is to give a brief introduction on the general framework of
ggplot2 via some basic examples, and relate how to tackle similar quests using
plotly. Along the way, some pros and cons from each package will be highlighted.
Whenever you need to choose between some packages (and base R), it's important to balance the tasks each one were designed to handle, the amount of work it will require for you to achieve your goal (learning time included), and the time you actually have. It's also good to consider scale gains in future uses. For example, mastering
ggplot2 may not seem a smart choice for a single time task but might pay-off if you're expecting lots of graphical challenges in the future.
Keep in mind that all the three packages are eligible for a large convoy of tasks. There are some jobs that a specific package is more suitable for and even some tasks that can be considered almost impracticable for others. This point will become clearer as the book goes on.
The only requirement this section holds is to have the
plotly packages properly installed. Go back to Installing and loading graphics packages recipe if that is not the case. Once the installation is checked, it's time to know
Firstthings first, in order to plot using
ggplot2, data must come from a data frame object. Data can come from more than one data frame but it's mandatory to have it arranged into objects from the data frame class.
- We took the
carsdata set to fit this first graphic. It's good to actually get to know the data before plotting, so let's do it using the
> ?cars > class(cars) > head(cars)
- Plots coming from
ggplot2can be stored by objects. They would fit two classes at same time,
> library(ggplot2) > plot1 <- ggplot(cars, aes(x = speed,y = dist))
Objects created by the
ggplot() function get to be from classes
ggplot at the same time. That said, you can to refer to a plot crafted by
ggplot2 as a
- The three packages work more or less in a layered way. To add what we call layers to a
ggplot, we can use the
> plot1 + geom_point()
Result is shown by the following figure:
Figure 1.2 - Simple ggplot2 scatterplot.
- Once you learn this framework, getting to know how
ggvisworks becomes much easier, and vice-versa. A similar graphic can be crafted with the following code:
> library(ggvis) > ggvis(data = cars, x = ~speed, y = ~dist) %>% layer_points()
plotlywould feel a little bit different, but it's not difficult at all to grasp how it works:
> library(plotly) > plot_ly(data = cars, x = ~speed, y = ~dist, type = 'scatter', mode = 'markers')
Let's give these nuts and bolts some explanations.
In order to have a brief data introduction, step 1 starts by calling
?cars. This is a very useful way to get to meet variables and background related to almost every data set coming from a package. Once
ggplot2 requires data coming from data frames,
class() function is checking if is that the case, answer is affirmative. At the end of this step
head() function is checking upon the first six observations.
Moving on to step 2, after loading
ggplot2, it demonstrates how to store the basic coordinate mapping and aesthetics into an object called
plot1 (try it on the
class() function). In order to set the basics, it uses a function (
ggplot()) that initializes every single
Storing a plot coming from
plotly package into an object is optional, though very useful way to proceed.
To properly set
ggplot(), start by declaring data set using
data argument. After that, some basic aesthetics and coordinates are assigned. Different figures can ask and work along with different aesthetics, for the majority of cases those are named inside the
As the books goes on you're going to get used to the ways how aesthetics can be declared-in or outside the
aes() function. For now, let's acknowledged that inside
aes() it's possible to call data frame variables by name and they may be displayed in legends.
"..." as argument, popularly known as three-dots but technically named ellipsis. It allows the user to pass an arbitrary number and variety of arguments. So as
ggplot2 does lazy-evaluation (only evaluates arguments as they are requested, you could make up arguments and pass them into the
aes() function with zero or only little trouble to the function. Perceive the following:
> plot1 <- ggplot(cars, aes(x = speed,y = dist, gorillaTroubleShooter = T, sight = 'Legolas'))
It would work as good as the earlier version. Just don't forget to name the arguments and you got yourself a good way to create some Easter eggs at your code (also a good way to confuse unaware developers). Both
ggplot() play core roles in building graphics within this package.
Until step 2, only coordinate mapping was set at object named
plot1, calling for it alone displays an empty graphic. Step 3 uses
%+% to add a layer, the layer called (
geom_point()) took care of fixing a geometry to the graphic. Besides the plus sign,
ggplots are usually constructed by two families of functions (layers):
stat_*. While the first family comes with a fixed geometry and a default statistical transformation, the second one comes with fixed statistical transformations and a default geometry (this is grammar of graphics for real), defaults can be tweaked.
plot1 + stat_identity(geom = 'point') works just the same as step 3. Argument
geom is set for
'point' as default for
stat_identity(), it's fine to skip it. The reason I declared it was to reinforce that if you call for a statistical transformation you can pick the geometry and it goes the other way round (if you call for a geometry you can change the statistical transformation).
Behind the scene,
geom_point() called the
layer() function, which set a couple of arguments that culminated in the creation of a scatterplot. One may want to modify the axis labels and add a regression line. It can be done by simply adding more layers to the plot using the plus sign. One can stack as many layers desired, as shown next:
> plot1 + geom_point() + > labs(x = "Speed (mpg)", y = "Distance (ft)") + > geom_smooth(method = "lm", se = F) + > scale_y_continuous(breaks = seq(0, 125, 25))
Result is exhibited by figure 1.3:
Figure 1.3 - Adding up several layers to a ggplot.
ggplot2's sum operator (that is actually a function) and functions allows the user to make plots in a layered, iterative way. It splits complex graphics construction into several simple steps. It's also very intuitive and does not get any harder as you practice.
Yet, there are limitations. The difficulty to make interactive graphics by itselft may be one. These tasks, in the majority of the cases, are very well handled by both
plotly as stand alone packages. This leads us to steps 4 and 5.
plotly::ggplotly() after bringing a
ggplot up will coerce it into an interactive plot. It may fail sometimes. Do not forget to have
Step 4 loads
ggvis package using
library() and then gives birth to an interactive plot. It holds many similarities with
ggvis() handles basic coordinating mapping while pipe operator (
%>%) is used to add up a layer called by the
layer_points() function. Remember, pipe operator and not plus sign.
ggvis understands different arguments declared using
= (ever scaled) and
:= (never scaled). Also,
~ must come before the variable names.
Function names may change and also does the operator used to add up layers from
ggvis, but essentially the underlying logic keeps still. Layers coming from
ggvis has several correspondences with
ggplot2's ones; refer to the See also section to track some. In comparison with
ggvis is much younger and some utilities may be yet to come, also data don't need to come from a data frame object.
Step 5 draws an interactive
plotly graph. A single function (
plot_ly()) takes care of coordinate mapping and geometry. It can be designed a little more layered using the
add_traces() function, but there is no real need for that when the plot is too simple. Instead of having many functions demanding statistical transformations and geometries those are declared by arguments inside the main function.
These three packages,
plotly, are well coded and powerful graphic packages. Right before picking one of them to handle a task do ever consider some points like:
- What the package is able to do
- Time needed to master the skill set required
- Time required to handle the task
- Amount of time available
- Time to be saved later by the thing that you learned
Base R is also a feasible possibility. Whenever you face new challenges, it is a good thing to think through these points.
To have data coming solely from data frames is a strong restriction, but it does obligate the user to be explicit about the data and also draw a very clear line on what is
ggplot2's concern (data visualization) and what is not (model visualization). In order to avoid headaches that come from downloading spreadsheets, setting up working directories, and loading data from files, we're taking an alternative way: getting data from packages instead.
By doing this, we ensure that the readers only need to reach the R's console to reproduce recipes; we want nothing to do with web browsers (we're too cool for school, school meaning web browsers). We shall follow this approach to the end of the book. This recipe look over
datasets base packages to do so.
ggplot2 has some data frames of its own.
It's also important to outline that the gg in the
ggvis refer to the Grammar of Graphics. That's a very important and inspiring theory that in had influenced
plotly. The layered/iterative way that these packages handle plots might come from the Grammar of Graphics and makes graphics building much easier and reasonable. Learning this theory may give you heads into the process of learning these packages while learning these packages may give you heads when it comes to learn the Grammar of Graphics.
ggplot2Cheatsheet made by Rstudio can be found at https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Learn layers from
ggplot2's author and a real R-universe star at http://rpubs.com/hadley/ggplot2-layers
Did you know that gg's
+is actually a shortcut for a function? A clue on that and some exercises are hidden at http://rpubs.com/hadley/97970
Learn more about
ggvislayers and how they can be translated into
ggplot2ones at http://ggvis.rstudio.com/layers.html
Learn more about
ggvisscaled and unscaled arguments at http://ggvis.rstudio.com/properties-scales.html
Previously, a brief introduction on the frameworks of
plotly package was conducted. Next we are getting started with
ggplot2 graphical primitives, using them in a series of recipes with related examples made with
There are a total of eight graphical primitives at
ggplot2, one of them already covered in this chapter (
geom_point()). It's important to know the primitives well-what they do and when to use them. As fundamental building blocks, they play an essential role in the drawing process. A series of tasks can be handled relying on primitives when there is no dedicated function to handle some task; sometimes even if there is, primitives can handle it much better.
A good example are the dot plots. They have this dedicated
geom_dotplot() function, but sometimes it is much easier to draw dot plots using
geom_point(). Now, let's see how
ggplot2can brew figures using primitives and create related ones using
- After loading the package, primitives
geom_path()can be stacked in order to plot lines with markers:
> library(ggplot2) > plot1 <- ggplot( cars, aes(x = speed, y = dist)) > plot1 + geom_point() + geom_path()
The resulting output is shown by following figure:
Figure 1.4 - Lines with markers plot made by ggplot2's primitives.
- Same mission can be nailed by the
ggvispackage, relying on the following code:
> library(ggvis) > ggvis(cars, x = ~speed, y = ~dist) %>% layer_points() %>% layer_paths()
Following figure 1.5 displays a representation of the resulting graphic (only default theme will look different):
Figure 1.5 - Similar lines and markers plot done by ggvis.
- Without using the translation function (
plotlypackage, it's also possible to code a similar graphic from scratch relying only on
> library(plotly) > plot_ly(cars, x = ~speed, y = ~dist, type = 'scatter', mode = 'lines+markers')
Following figure 1.6 exhibits a snapshot of the graphic brewed by the latest code:
Figure 1.6 - Similar lines and markers plot done by plotly.
Let's understand how these are unfolding.
Complete list of
ggplot2's primitives is given by
point(). Every primitive starts with
geom_* but not every
geom_* is a primitive. In fact, the better odds stands for quite the opposite.
More or less,
geom_blank() seems to be the simplest of the primitives. Calling it right after setting
ggplot() will display a blank plot with axis already adjusted. It's mostly used to check axes limits given by data itself. Maybe you can find it useful for another task; suit yourself.
Other primitives may work in a similar way. That is the case for
geom_polygon() functions. The first one draws lines between coordinates, second one looks like the first but thicker, requiring additional
aes() arguments (
ymax). Last function draws filled polygons.
By setting only the starting and ending points,
geom_segment() adds a segment line.
geom_rect() adds a rectangle to the plot, requiring four corners to do so (
geom_text()add texts to the given coordinates. Some graphics displays only texts for each observations instead of points, also a good way to display additional information.
The remaining primitive is
geom_point(). It's the only primitive direct called so far, it plot points at given coordinates. Two important points must be highlighted here. One, getting to know the primitives might give you an idea about which function you will require the most and which one the least, but that is not all that
ggplot2 is capable of doing. Primitives are nothing but the building blocks used by other functions.
For the second point, as the previous recipe stated earlier, you can stack as many layers as you feel like. That is not less true for primitives functions, but it's good to know how they interact with one another. For example, calling
geom_point() may not override the points with a blank space.
ggplot2 and setting base
aes(), step 1 is creating a simple plot with lines and markers. While
geom_point() displays the markers,
geom_path() draws the lines between them. Note that the last function draws lines following the order given by data set rows, so we can call this function order-sensitive.
For many situations, reordering data will improve viz. This may be the case for dot, box, violin, bar plots, and others. If you want paths to be ordered within the
geom_line() does that by itself, though it is not a primitive.
To this particular plot, the lines attach no meaning; they actually mislead. Lines are better designated to indicate some sort of order within the data, like chronological order. The only reason they were used was to demonstrate how primitives could be stacked to originate different viz from the one done before.
Step 2 is drawing a plot similar to the one crafted by step 1 but using
libray() loads the package while the
ggvis() function is used to map the basic aesthetics. Following function (
layer_points()) sets up the points to work as our markers and
layer_paths() draws the lines between them.
Earlier section argued that
ggvis is very similar to
ggplot2 in the ways of coding graphics. This section actually demonstrated that. First, the function gets the data set and the variables are inputted as arguments. Pipe operators (
%>%) are used instead of plus sign to stack up the layers, and
layer_* works in a very similar way as
By step 3, a similar
plotly graphic is crafted. Same function responsible for setting basic aesthetic mapping (
plot_ly()) is also dealing geometries. Arguments
mode set the geometries, both inputted with strings. These two arguments are meant to work together.
type = 'scatter' enables the lines and markers modes. Each type has a whole particular convoy of modes attached to it; consult the reference manual to catch them all. The way we wanted to is to use markers and lines at same time so we built a string containing those two elements separated by the plus sign (
'lines+markers'), and assigned it to
mode = 'lines+markers' works as good as
mode = 'markers+lines'. Modes can be stacked and order does not matter.
Figures 1.4 to 1.6 five resembles much a time series, but they aren't and it may give the wrong intuition.There are observations for two variables and neither one is time. Notice how for some speeds values there are up to 4 different distances to stop. Note that the cars data frame is ordered first by speed and then by distance, paths obey the row order showed by data while for point geometry order doesn't really matter.
Adding path geometry was misleading,
geom_point() would be enough. Goal here was to demonstrate primitives interaction and not to give a meaningful figure. Next, let's build fictional data and draw a graphic that tells the story the right way. Picture a small classroom with only 7 students. The teacher builds a data frame with studying hours and grades for each student.
Data can be created like this:
> allnames <- c('Phill','Ross','Kate','Patrice','Peter','James','Monica') > classr <- data.frame(names = allnames) > classr$hours <- c(4, 16, 8, 11, 6, 14, 8) > classr$grades <- c(4, 9.5, 6, 4, 6, 9, 7.5)
geom_text() primitive could be used to summon a meaningful graphic:
> library(ggplot2) > plot2 <- ggplot( classr, aes(x = hours, y = grades)) > plot2 + geom_text( aes( labels = names))
The result would be like shown in the following figure 1.7:
Figure 1.7 - Plotting grades and hours as texts using ggplot2's primitive.
plotly codes are shown next:
> library(ggvis) > ggvis(classr, x = ~hours, y = ~grades, text := ~names) %>% layer_text() > library(plotly) > plot_ly(classr, x = ~hours, y = ~grades, type = 'scatter', mode = 'text', text = ~names)
This last brief example illustrates how to brew graphics using only primitives in a more meaningful way. It's very important to think about it. The better graphic is the one that tells the right story objectively and not the one with many layers.
Did you know that both
plotly can guess which geometry you are looking for? Based on the basic aesthetics defined, they make a guess and adopt certain geometry. They look at how many variables of what kind (discrete or continuous) were inputted, and for some combinations they are able to make a guess.For the nearest example they would have guessed points geometry.
Figures breed by both packages will be displayed by the
Viewer tab if you're using RStudio (They are interactive! Try hoovering the mouse over a
plotly figure). Figures can be exported as web pages. Other than that, they can be exported as PNG, JPEG, and BMP, therefore losing the interactive property.
This recipe aimed to demonstrate how to construct plots using
ggplot2 primitives, and build similar graphs using other packages. A question you should always ask yourself is if the geometry adopted goes along with the data used. In other words, if the graphic tells the story that you are willing to.
The recipes's goal was to introduce you to the graphical primitives of
ggplot2 and draw simple graphics by using only primitives. Additional goal was to draw related graphics using the
The next chapters dive deeper; each one shall tackle some families of graphics, highlighting nuts and bolts in the way to building high quality plots. As the book advances, so does the complexity involved. At some point, we are going to be plotting interactive globes, 3D surfaces and developing web applications. I find it pretty sicking cool, hope you enjoy it.
Chapter 2, Plotting Two Continuous Variables, takes care of scatterplots. It's a very popular kind of plot, and very useful too, but there is a big problem: over-plotting. Following chapter will not only teach how to craft scatterplots, but also teach how to deal with such problem and how to improve scatters by deploying marginal plots. Let it rip!