Data Visualization

 In this article by Atmajitsinh Gohil, author of the book, R Data Visualization Cookbook we will cover the following recipes:

  • Introducing a scatter plot
  • Scatter plot with texts, labels, and lines
  • Connecting points in a scatter plot
  • Generating an interactive scatter plot
  • A simple bar plot
  • An interactive bar plot
  • A simple line plot
  • Line plot to tell an effective story
  • Generating an interactive Gantt/timeline chart in R
  • Merging histograms
  • Making an interactive bubble plot
  • Constructing a waterfall plot

(For more resources related to this topic, see here.)

Introduction

The main motivation behind this article is to introduce the basics of plotting in R and element of interactivity via the googleVis package. The basic plots are important as many packages developed in R use basic plot arguments and hence to understand them creates a good foundation for new R users. We will start by exploring the scatter plots in R, which are the most basic plots for exploratory data analysis, and then dive into interactive plots. Every section will start with an introduction to basic R plot and we will build interactive plots thereafter. We will utilize the power of R analytics and implement them using the googleVis package to introduce the element of interactivity.

The googleVis package is developed by Google and it uses the Google Chart API to create interactive plots. There is a range of plots available with the googleVis package and this provides us with an advantage to plot the same data on various plots and select the one that delivers us the correct message. The package undergoes regular updates and releases, and new charts are implemented with every release.

The readers should note that there are other alternatives available to create interactive plots in R, but it is not possible to explore all of them and hence I have selected googleVis to display interactive elements in a chart. I have selected these purely based on my experience with interactivity in plots. The other good interactive package is offered by GGobi.

The first part introduces the basics of plotting in R using scatter plot as an example and also introduces the users to interactivity using the iPlots package. The second part introduces bar plot functionality in R and further introduces the googleVis package to create an interactive bar plot. The third part delves into line plots and how we can make them more meaningful by simply making use of the options available in the line plot functionality in the googleVis package. The fourth section of the book discusses interactive histograms. We conclude the article by introducing interactive bubble plots and waterfall plots in parts five and six, respectively.

Introducing a scatter plot

Scatter plots are used primarily to conduct a quick analysis of the relationships among different variables in our data. It is simply plotting points on the x-axis and y-axis. Scatter plots help us detect if two variables have a positive, negative, or no relationship. In this recipe, we will study the basics of plotting in R using scatter plots. The following screenshot is an example of a scatter plot:

R Data Visualization Cookbook

Getting ready

For implementing the basic scatter plot in R, we would use Carseats data available with ISLR package in R.

How to do it…

We will also start this recipe by installing necessary packages using the install.packages() function and loading the same in R using the library() function:.

install.packages("ISLR")
library(ISLR)

Next, we need to load the data in R. Almost all R packages come with preloaded data and hence you can load the data only after you load the library in R. We can attach the data in R using the attach() function. We can view the entire list of datasets along with their respective libraries in R by typing data() in R console window:

attach(Carseats)

Once we attach the data, it's a good practice to view the data using head(Carseats). The head() function will display the first six entries of the dataset and will allow us to know the exact column headings of the data:

head(Carseats)

The data can be plotted in R by calling the plot() function. The plot()function in R comes with a variety of options and the best way to know all the options is by simply typing ?plot() in the R console window:

plot(Income, Sales,col = c(Urban),pch = 20, main ="sales of Child 
  Car Seats", xlab = "Income (000's of Dollars)", 
    ylab ="Unit Sales (in 000's)" )

This particular plot requires us to plot the legends as the points have two different color schemes. In R, we can add a legend using the legend() function:

legend("topright",cex = 0.6, fill = c("red","black"), 
  legend = c("Yes","No"))

How it works…

The install.packages() and library() functions are used in most of the recipes in this book.

The attach() function is a nice way to reference the data as this allows us to avoid typing the $ notation. The $ notation is another way to reference columns in a data and is discussed in the next recipe. Once we attach the data, it's a good practice to view the data using head(Carseats). The head() f unction has data as its first argument. To view less number of lines in the R console window, we can also type head(Carseats, 3). The tail(Carseats) function will display data entries from the bottom of the dataset.

The data can be plotted in R by calling the plot() function. The first two arguments in the plot() function refer to the data to be plotted on the x-axis (Income) and y-axis (Sales). The col argument allows us to assign color to our data points. In this case, we would like to use a qualitative data column (Urban) to color our points. The default color in R is black but we can change this using the col = "blue" argument. Please refer to to the code file to learn about various other options. The pch = 20 argument allows us to plot symbols; the value 20 will plot filled circles. To view all the available pch values, please type ?par or ?points in the R console window. We can also label the heading of the plot using the main ="Sales" argument. The xlab and ylab arguments are used to label the x and y axes in R.

To display a legend is necessary for this scatter plot as we would like to differentiate between sales in urban and rural areas. The first argument in the legend() function corresponds to the position of the legend. The cex argument is used to size the text, the default value for cex is 1. The fill argument fills the boxes with the specified colors and the legend argument applies the labels to each of the box.

Scatter plots with texts, labels, and lines

In the previous recipe, we studied how to construct a very basic scatter plot. In order for the plot to deliver a strong message, we need to add elements such as text, labels, and lines. The main objective of a visualization is to grab the attention of its audience and make the optimal use of the data available. The audience should be able to get most of its information from the visualization itself.

The following screenshot plots the child mortality rate in selected countries. The story we would like to share with the readers is the relationship between child mortality rate and Gross Domestic Product (GDP) of a country. We can improve on our understanding of these relationships if the readers can compare extreme scenarios or compare a specific country with a benchmark (average child mortality rate).

R Data Visualization Cookbook

How to do it…

In the previous recipe, we used a dataset from the ISLR package. But what if we would like to import our own data in R? We can set a working directory in R using the setwd() function. This is a necessary step as R will always search for the datafile in the active/current directory. The setwd() function allows us to set our working directory:

setwd("D:/book/scatter_Area/data")

The read.csv() function is used to import the data in R:

child = read.csv("chlmort.csv", header = TRUE, sep =",")

The summary() function is used to get a general idea about the distribution of variables in our data. The head() function allows us to view the actual data:

summary(child)
head(child)

The following code is used to plot the skeleton of our scatter plot. Few of the arguments may look very familiar to you from the previous recipe. We have used child$gdp_bil and child$child instead of gdp_bil and child. This change was necessary as we did not use the attach() command:

plot(child$gdp_bil, child$child, pch = 20, col = "#756bb1", 
xlim=c(0,max(child$gdp_bil)), ylim = c(0,190), xlab = "GDP in 
Billions in current US$", ylab ="Child Mortality rate", main = 
"child Mortality Rate in selected countries for 2012")

In order to plot a horizontal or a vertical line in R, we can use the abline() function. The h =() argument will draw a horizontal line. The value 36.18 is the world average of child mortality rate and to add this makes it easier to compare the data across countries. The lwd = 1 argument increases the width of the line and col = "red" adds color to the line:.

abline(h = (36.18), lwd = 1, col = "red")

To generate an effective presentation, we add labels to extreme points on our plot. We can immediately observe that GDP and child mortality rate share a negative relationship. We can go a step further and make the plot easy to interpret if we add text, using the text() function, to extreme observations in our data:

text(8000,25,labels = c("Luxemborg"), cex = 0.75)
text(600,182,labels= c("Sierra Leone"), col = "red", cex = 0.75)
text(4000, 50,labels = c("Average Child Mortality"), 
  col = "red", cex = 0.75)

 

How it works…

To import data in R, we need to direct R to the folder where the data is stored. We can either type the command setwd() to let R know where to find the file, or we can navigate to the folder via Session | Set Working Directory | Choose Directory.

Under the plot() function, we have introduced the $ notation. The name before the $ sign corresponds to the data and the name after the $ sign refers to the column (child$gdp_bil). We have used ylim() to specify the y limit for the plot. We can also assign a hex value to the color instead of simply typing in the name of the color using col = "#756bb1". All the remaining arguments are discussed in detail in the previous recipe.

The text() function uses three arguments: x-axis, y-axis, and labels. The x-axis and y-axis arguments inform R as to where exactly to place the labels. The labels =c() argument is the actual label to be placed. The cex = 0.75 argument is used to state the size of the fonts. We can learn about various other text arguments available under R using the command ?text().

There's more…

At times, we would like to add a trend line to our plot. We could achieve this in a variety of ways by fitting a regression line or using the scatter.smooth() function that allows users to plot a smooth line using Locally Weighted Scatterplot Smoothing (LOESS). We will use LOESS to study the trend in our data. The idea behind the LOESS is to fit a weighted polynomial with more weight to points near the points whose value is being estimated and less weight to points further away. The method was initially proposed by Cleveland. We will avoid going into the mathematical details of how LOESS is calculated, and instead we will assume that R will correctly apply this to our data.

The dashed line in the following screenshot is not the usual regression line but a trend line fitted using LOESS methodology:

R Data Visualization Cookbook

Plotting a trend line is not very difficult. The following code would first import the data and then plot the trend line using the LOESS method using the scatter.smooth() function:

scatter.smooth(child$gdp_bil, child$child, pch = 20, lwd =0.75, 
  col = "Blue", lpars = list(lty = 3, col ="black", lwd = 2),
    xlab ="GDP in Billions in current US$", ylab ="Child Mortality 
      rate", main = "child Mortality Rate in selected countries 
        for 2012" )

We use the lpars() function to beautify the trend line. The attributes passed in the lpars function are the same attributes passed in the previous two plots. Readers can learn more about the scatter.smooth() function by typing ?scatter.smooth in the R console window.

See also

Connecting points in a scatter plot

The primary objective of this recipe is to understand how we can connect points in a scatter plot. The plot is inspired by Alberto Cairo infographic regarding the Gini coefficient and the GDP data under various president's tenure in Brazil and connected points based on these three variables. In this recipe, we will apply the same concept to the USA economy.

R Data Visualization Cookbook

How to do it…

We will start to import the data in R. The dataset comprises of Gini coefficient, GDP data of USA and USA Presidents. The Gini coefficient is used as a measure of inequality in a country:

data = read.csv("ginivsgdp1.csv", header = TRUE)

The plot can be generated using the plot() function:

plot(income$gdp_ann,income$Gini,pch = 20, col = 
  c(data$Presidents), type = "o",xlab =" GDP of USA", ylab = "Gini 
    coefficient",main = "Inequality in USA", xaxp = c(0,18000,8))

Since we use col to distinguish between periods of various presidents in USA, we require legends in the plot. The legend is added to our plot using the legend() function:

legend("bottomright",fill = c(6,7,4,2,9,5,3,1,8), legend = 
  c("Johnson","Nixon","Ford","Carter","Reagan","G.Bush","Clinton",
    "Bush","Obama"), bty = "n", cex=0.7)

How it works…

Most of the arguments used in the plot() function have been discussed in prior recipes of this article. The type = "o" argument connects lines in a plot. Readers curious to know more about the various types of option should type ?plot in the R console window.

The important point to note is that R used a qualitative variable such as Presidents to plot points of different colors. We would require the color names to pass as an argument under the fill. R converts the presidents' names into some numeric value and uses it to color each point. We can view these numeric values by typing the following lines:

cols = as.numeric(income$Presidents)
cols

We can simply use these numeric values and pass it as a vector under the legend() function. The first argument in the legend() function is the position of the legend. The third argument corresponds to the labels. We suppress drawing a box around the label using the bty = "n" argument. The cex argument allows us to size the labels in R.

There's more…

We could also add texts and lines to our plot as shown in the following screenshot:

R Data Visualization Cookbook

We have implemented the same code as mentioned in the How to do it… section of this recipe. But instead of applying different colors, we can also add a line and text to our chart:

abline(v = 14958, lwd =1.5)
text(16200, 0.46,"obama")

The text() functionality in R will allow us to add a text. The first and second arguments under text() represent the x and y coordinates of the plot. The third argument is the actual label to be applied. The abline() function is used to apply a vertical line to our plot. To plot labels on all the points, we can use the following code:

plot(income$gdp_ann,income$Gini,pch = 20, col = "Black", 
  xlab =" GDP of USA", ylab = "Gini coefficient", 
    xaxp = c(0,18000,10),bty = "n")
text(income$gdp_ann, income$Gini,income$years, cex = 0.7, 
  pos = 2, offset = 0)

Since we require to plot all the years, we pass income$years as our argument for labels. The pos argument is used to adjust the position of the label around the point. Readers may observe overlapping labels and they can try to fix this by setting pos and offset. I would suggest the readers to type ?text in the R console window to learn more about the text() function.

See also

Generating an interactive scatter plot

In the previous recipe, we studied how multivariate data can be displayed on a scatter plot. We used color as a visual cue to display information related to different presidents in USA and how the economy performed. In this recipe, we will build on the same idea and introduce interactive scatter plots. The limitation of a static plot is that they are hard to interpret if the points overlap, and if the gridlines are not present the data may be hard to decipher as well. Interactive plots help us to overcome this limitation. In this recipe, we will plot a simple interactive scatter plot with a trend line as shown in the following screenshot:

R Data Visualization Cookbook

Getting ready

We would plot an interactive scatter plot using the googleVis package in R.

How to do it…

To generate an interactive scatter plot, we will install and load the googleVis package in R. We can import the data in R using the read.csv() function:

install.packages("googleVis")
library(googleVis)
income = read.csv("ginivsgdp1.csv", header = TRUE)

By default, the googleVis package will use the first column as the x variable. Since our first column is not GDP, we will use the GDP data and gini data from the imported CSV file to construct a new data frame in R:

scater = data.frame(gdp = c(income$gdp_ann),gini= c(income$Gini))

Now, we can generate an interactive scatter plot in R. Note that the googleVis package will display the scatter plot in a new browser window only when the plot() function is executed:

scaterp4 = gvisScatterChart(scater, option= list(width = 650, 
  height = 600, legend = "none",title = "Reltionship between 
    Inequality and GDP growth in USA",
      hAxis = "{title :'GDP'}",
      vAxis = "{title :'Gini'}",
      dataOpacity = 0.8,
      trendlines="{0:{type : 'linear', visibleInLegend: true, 
        showR2: true}}"))
plot(scaterp4)

How it works…

Readers new to R can learn about the data.frame() function by typing ?data.frame in the R console window.

The first argument in the gvisScatterChart() function is our data frame. We can have more than one column in our data frame but the x-axis will be assigned to the first column. The googleVis package comes with some very useful options that allow us to add labels to x and y axes, and add title and opacity to our scatter plot. We will discuss these in detail at a later point in this article. The options are added to a plot using the option() function.

The trendlines argument adds a linear trend line to our scatter plot. Note the use of 0 in the trendline argument that corresponds to the series. The visibleInLegend and showR2 arguments add the estimates and coefficient of determination to the plots legend section.

There's more…

In this section, we will learn to plot multiple y-axis values on the same scatter plot. This is shown in the following screenshot. Note that we have processed the data and stored it as a new CSV file.

R Data Visualization Cookbook

The code used to generate this plot is exactly the same as the one discussed under the How to do it… section of this recipe. But we have altered the data file and stored it as a new CSV file. To learn more about various options available, please refer to the googleVis developer website.

See also

A simple bar plot

A bar plot can often be confused with histograms (studied later in this article). Histograms are used to study the distribution of data whereas bar plots are used to study categorical data. Both the plots may look similar to naked eye but the main difference is that the width of a bar plot is not of significance, whereas in histograms the width of the bars signifies the frequency of data..

In this recipe, I have made use of Infant Mortality Rate in India. The data is made available by the Government of India. The main objective is to study the basics of a bar plot in R as shown in the following screenshot:

R Data Visualization Cookbook

How to do it…

We start the recipe by importing our data in R using the read.csv() function. R will search for the data under the current directory, and hence we use the setwd() function to set our working directory:

setwd("D:/book/scatter_Area/chapter1")
data = read.csv("infant.csv", header = TRUE)

Once we import the data, we would like to process the data by ordering this. We will order the data using the order() function in R. We would like R to order the column Total2011 in a decreasing order:

data = data[order(data$Total2011, decreasing = TRUE),]

We use the ifelse() functionality to create a new column. We would utilize this new column to add different colors to bars in our plot. We could also write a loop in R to do this task but we will keep this for later. The ifelse() function is quick and easy. We instruct R to assign yes if values in the column Total2011 are more than 12.2 and no otherwise. The 12.2 value is not randomly chosen but is the average infant mortality rate of India:

new = ifelse(data$Total2011>12.2,"yes","no")

We would now like to join the vector of yes and no to our original dataset. In R, we can join columns using the cbind() function. Rows can be combined using rbind():

data = cbind(data,new)

When we initially plot the bar plot, we observe that we need more space at the bottom of the plot. We can achieve this in R by passing the mar() argument within the par() function. The mar() function uses four arguments: bottom, left, top, and right spacing:

par(mar = c(10,5,5,5))

We can now generate a bar plot in R using the barplot() function. The abline() function is used to add a horizontal line on the bar plot:

barplot(data$Total2011, las = 2, names.arg= data$India,width = 
  0.80, border = NA,ylim=c(0,20), col = "#e34a33", main = "Infant 
    Mortality Rate of India in 2011")
abline(h = 12.2, lwd =2, col = "white", lty =2)

How it works…

The order() function uses permutation to rearrange (decreasing or increasing) the rows based on the variable. We would like to plot the bars from highest to lowest, and hence we require to arrange the data. The ifelse() function is used to generate a new column. We would use this column under the There's more… section of this recipe. The first argument under the ifelse() function is the statement logical test to be performed. The second argument is the value to be assigned if the test is true, and the third argument is the value to be assigned if the logical test fails.

The first argument in the barplot() function defines the height of the bars and horiz = TRUE (not used in our code) instructs R to plot bars horizontally. The default setting in R will plot bars vertically. The names.arg argument is used to label the bars. We also specify border = NA to remove the borders and las = 2 is specified to apply the direction to our labels. Try using the las values as 1,2,3, or 4 and observe how the plot changes.

The first argument in the abline() function assigns the position where the line is drawn, that is, vertical or horizontal. The lwd, lty, and col arguments are used to define the width, line type, and color of the line.

There's more…

While plotting a bar plot, it's a good practice to order the data in ascending or descending order. An unordered bar plot does not convey the right message and the plot is hard to read when there are more bars involved. When we observe a plot, we are interested to get the most information out, and ordering the data is the first step toward achieving this objective.

We have not specified how we can use the ifelse() and cbind() functions in the plot. If we would like to color the plot with different colors to let the readers know which states have high infant mortality above the country level; we can do this by pasting col = (data$new) in place of col = "#e34a33".

See also

An interactive bar plot

We would like to make the bar plot interactive. The advantage of using the Google Chart API in R is the flexibility this provides in making interactive plots. The googleVis package allows us to skip the step to export a plot from R to an illustrator and we can make presentable plots right out of R.

The bar plot functionality in R comes with various options and it is not possible to demonstrate all of the options in this recipe. We will try to explore plot options that are specific to bar plot.

In this recipe, we will learn to plot returns data of Microsoft over a 2-year period. The data for the exercise was downloaded using Google Finance. We have calculated one day returns for Microsoft in MS Excel and exported the CSV in R:

Return = ((Pricet  - Pricet-1)/pricet-1) *100

Readers should note that the following plot is only a part of the actual chart:

R Data Visualization Cookbook

Getting ready

In order to plot a bar plot, we would install the googleVis package.

How to do it…

We would start the recipe by installing the googleVis package and loading the same in R:

install.packages("googleVis")
library(googleVis)

When R loads a library, it loads several messages: we can suppress these messages using suppressPackageStartupMessages (library (googleVis)). We can now import our data using the read.csv() function. The data file comprises of three variables: date, daily Microsoft prices, and daily returns:

stock = read.csv("spq.csv", header = TRUE)

We generate an interactive bar plot using the gvisBarChart() function:

barpt = gvisBarChart(stock, xvar = "Date", yvar = c("Returns"),
  options = list(orientation = "horizontal", width = 1400,
  height = 500,title = "Microsoft returns over 2 year period",
      legend = "none",
hAxis = "{title :'Time Period',titleTextStyle :{color:'red'}}",
vAxis = "{title : 'Returns(%)', ticks : [-12,-6,0,6, 
  12],titleTextStyle :{color: 'red'}}",
bar = "{groupWidth: '100%'}"))

The plot is generated in a new browser window when we execute the plot() function:

plot(barpt)

How it works…

The googleVis Package will generate a plot in a new browser window and requires an Internet connectivity to generate the same in R. The googleVis package is a simple communication medium between the Google Charts API and R. The following code creates a bar plot.

We have defined our gvisBarChart() function as barpt; we need this in R as all the googleVis functions will generate a list that contains the HTML code and reference to a JavaScript function. If you omit barpt =, R will display the code generated in your command window.

The first argument under the gvisBarChart() function is the data argument; we have stored the data in R as a data frame called stock. The second and third arguments are the column names of the data to be displayed on the x-axis and y-axis. The options argument is used to define a list of options. All the available options related to the bar plot and their descriptions are available on the googleVis developer website. Since we are plotting stock returns of Microsoft (only one series), we can avoid legends. The default plot will include legends—this can be overwritten using legend = "none".

To add title to our plot, we use the title attribute. We label our axes using the vAxis and hAxis attributes. Note the use of { } and [ ] within vAxis and hAxis. We make use of { } to group all the elements related to hAxis and vAxis instead of typing vAxis.title or vAxis.title.TextStyle. If readers are familiar with CSS or HTML, this code would be very easy to interpret. We have used the group.width attribute and set it to 100 percent in order to eliminate the spacing between bars. Finally, we call the plot() function to display our visualization.

There's more…

In the previous recipe, we constructed a bar plot. The googleVis package also allows us to create a table and merge the same with a bar chart. The following screenshot is only part of the plot:

R Data Visualization Cookbook

The combined table and bar chart are generated in three different steps. We will first generate a bar chart. The code is exactly the same as the one discussed under the How to do it… section of this recipe. We will then generate a table using the following lines of code:

table <- gvisTable(stock, options=list(page='enable',
                                      height='automatic',
                                      width='automatic'),
                               formats = list(Returns =' #.##'))

In the final step, we will merge the two chart objects (barpt and table) using the gvisMerge() function:

comb = gvisMerge(table,barpt, horizontal = TRUE)
plot(comb)

We can display the merged visualization comb using the plot() function.

The readers can learn more about exporting the visualization in the googleVis package manual. It is also possible to integrate the googleVis package plots with shiny.

See also

A simple line plot

Line plots are simply lines connecting all the x and y dots. They are very easy to interpret and are widely used to display an upward or downward trend in data. In this recipe, we will use the googleVis package and create an interactive R line plot. We will learn how we can emphasize on certain variables in our data. The following line plot shows fertility rate:

R Data Visualization Cookbook

Getting ready

We will use the googleVis package to plot the line plot.

How to do it…

In order to construct a line chart, we will install and load the googleVis package in R. We would also import the fertility data using the read.csv() function:

install.packages("googleVis")
library(googleVis)
frt = read.csv("fertility.csv", header = TRUE, sep =",")

The fertility data is downloaded from the OECD website. We can construct our line object using the gvisLineChart() function:

gvisLineChart(frt, xvar = "Year",                     
yvar=c("Australia","Austria","Belgium","Canada","Chile","OECD34"), 
options = list( width = 1100, height= 500, backgroundColor = 
  "#FFFF99",title ="Fertility Rate in OECD countries" ,
vAxis = "{title : 'Total Fertility 
  Rate',gridlines:{color:'#DEDECE',count : 4}, ticks : 
    [0,1,2,3,4]}",
series = "{0:{color:'black', visibleInLegend :false},
       1:{color:'BDBD9D', visibleInLegend :false},
       2:{color:'BDBD9D', visibleInLegend :false},
           3:{color:'BDBD9D', visibleInLegend :false},
           4:{color:'BDBD9D', visibleInLegend :false},
          34:{color:'3333FF', visibleInLegend :true}}"))

We can construct the visualization using the plot() function in R:

plot(line)

How it works…

The first three arguments of the gvisLineChart() function are the data and the name of columns to be plotted on the x-axis and y-axis. The options argument lists the chart API options to add and modify elements of a chart.

For the purpose of this recipe, we will use part of the dataset. Hence, while we assign the series to be plotted under yvar = c(), we will specify the column names that we would like to be plotted in our chart. Note that the series starts at 0, and hence Australia, which is the first column, is in fact series 0 and not 1.

For the purpose of this exercise, let us assume that we would like to demonstrate the mean fertility rate among all OECD economies to our audience. We can achieve this using series {} under option = list(). The series argument will allow us to specify or customize a specific series in our dataset. Under the gvisLineChart() function, we instruct Google API to color OECD series (series 34) and Australia (series 0) with a different color and also make legend visible only for OECD and not for all the series.

It would be best to display all the legends but we use this to show the flexibility that comes with Google Chart API. Finally, we can use the plot() function to plot the chart in a browser. The following screenshot displays a part of the data. The dim() function gives us a general idea about the dimensions of the fertility data:

R Data Visualization Cookbook

New York times Visualization often combines line plots with bar chart and pie charts. Readers should try constructing the visualization. We can use the gvisMerge() function to merge plots. The function allows merging of just two plots and hence the readers would have to use multiple gvisMerge() functions to create a very similar visualization. The same can also be constructed in R but we will lose the interactive element.

See also

Line plot to tell an effective story

In the previous recipe, we learned how to plot a very basic line plot and use some of the options. In this recipe, we will go a step further and make use of specific visual cues such as color and line width for easy interpretation.

Line charts are a great tool to visualize time series data. The fertility data is discrete but connecting points over time provides our audience with a direction. The visualization shows the amazing progress countries such as Mexico and Turkey have achieved in reducing their fertility rate.

OECD defines fertility rate as "Refers to the number of children that would be born per woman, assuming no female mortality at child-bearing ages and the age-specific fertility rates of a specified country and reference period".

Line plots have been widely used by New York Times to create very interesting infographics. This recipe is inspired by one of the New York Times visualizations. It is very important to understand that many of the infographics created by professionals are created using D3.js or Processing. We will not go into the detail of the same but it is good to know the working of these software and how they are used to create visualizations.

R Data Visualization Cookbook

Getting ready

We would require to install and load the googleVis package to construct a line chart.

How to do it…

To generate an interactive plot, we will load the fertility data in R using the read.csv() function. To generate a line chart that plots the entire dataset, we will use the gvisLineChart() function:

line = gvisLineChart(frt, xvar = "Year", yvar=c("Australia",
"Austria","Belgium","Canada","Chile","Czech.Republic",
"Denmark","Estonia","Finland","France","Germany","Greece","Hungary
",
"Iceland","Ireland","Israel","Italy","Japan","Korea","Luxembourg",
"Mexico",
"Netherlands","New.Zealand","Norway","Poland","Portugal","Slovakia
","Slovenia",
"Spain","Sweden","Switzerland","Turkey","United.Kingdom","United.
States","OECD34"), 
options = list( width = 1200, backgroundColor = "#ADAD85",title 
  ="Fertility Rate in OECD countries" ,
vAxis = "{gridlines:{color:'#DEDECE',count : 3}, ticks : 
  [0,1,2,3,4]}",
series = "{0:{color:'BDBD9D', visibleInLegend :false},
  20:{color:'009933', visibleInLegend :true},
  31:{color:'996600', visibleInLegend :true},
  34:{color:'3333FF', visibleInLegend :true}}"))

To display our visualization in a new browser, we use the generic R plot() function:

plot(line)

How it works…

The arguments passed in the gvisLineChart() function stated in the previous section have exactly the same as discussed under the simple line plot with some minor changes. We would like to plot the entire data for this exercise,and hence we have to state all the column names in yvar =c().

Also, we would like to color all the series with the same color but highlight Mexico, Turkey, and OECD average. We have achieved this in the previous code using series {} and further specify and customize colors and legend visibility for specific countries.

In this particular plot, we have made use of the same color for all the economies but have highlighted Mexico and Turkey to signify the development and growth that took place in the 5-year period. It would also be effective if our audience could compare the OECD average with Mexico and Turkey. This provides the audience with a benchmark they can compare with.

If we plot all the legends, it may make the plot too crowded and 34 legends may not make a very attractive plot. We could avoid this by only making specific legends visible rather than all. This makes our plot tidier.

See also

Generating an interactive Gantt/timeline chart in R

Wikipedia describes a Gantt chart as "illustrate the start and finish dates of the terminal elements and summary elements of a project". These charts are used to track the progress of project displayed against time. The first Gantt chart was developed by Karol Adamiecki in 1890.

Even though the most important application of Gantt charts is in project management, they have been applied in visualization to represent the following:

  • Historical era of artist
  • Periods during which baseball players are disabled
  • Vanishing Wall Street firms

R Data Visualization Cookbook

Getting ready

To generate a timeline plot, we will require to install and load the googleVis package in R.

How to do it…

We would import our data in R using the read.csv() function. The gvisTimeline() function requires the dates to be in date format in R. Hence, we use the as.POSIXct() and as.Character() functions to re-define a new data frame:

base = read.csv("disable.csv")
data = data.frame(position = as.character(base$position), player = 
  as.character(base$player), start = as.POSIXct(base$start), end = 
    as.POSIXct(base$end))

The data was collected by me and is available online. We use the gvisTimeline() function to generate an object, which is displayed using the plot() function:

baseball = gvisTimeline(data = data, rowlabel ="position",start 
  ="start", end = "end",barlabel ="player" , option = list(width = 
    1000, height = 900,timeline="{singleColor :'#002A3E'}"))
plot(baseball)

Many of the arguments used in the gvisTimeline() function are self-explanatory. The start and end arguments refer to the start date and end dates, respectively; in the case of baseball data, they correspond to the length of time for which players are on disability list. We have passed the single-color argument under the options to color all the lines with the same color.

See also

Merging histograms

Histograms help in studying the underlying distribution. It is more useful when we are trying to compare more than one histogram on the same plot; this provides us greater insight into the skewness and the overall distribution.

In this recipe, we will study how to plot a histogram using the googleVis package and how we merge more than one histogram on the same page. We will only merge two plots but we can merge more plots and try to adjust the width of each plot. This makes it easier to compare all the plots on the same page. The following plot shows two merged histograms:

R Data Visualization Cookbook

How to do it…

In order to generate a histogram, we will install the googleVis package as well as load the same in R:

install.packages("googleVis")
library(googleVis)

We have downloaded the prices of two different stocks and have calculated their daily returns over the entire period. We can load the data in R using the read.csv() function. Our main aim in this recipe is to plot two different histograms and plot them side by side in a browser. Hence, we require to divide our data in three different data frames. For the purpose of this recipe, we will plot the aapl and msft data frames:

stk = read.csv("stock_cor.csv", header = TRUE, sep = ",")
aapl = data.frame(stk$AAPL)
msft = data.frame(stk$MSFT)
googl = data.frame(stk$GOOGL)

To generate the histograms, we implement the gvisHistogram() function:

al = gvisHistogram(aapl, options = list(histogram = "{bucketSize 
  :1}",legend = "none",title ='Distribution of AAPL Returns', 
    width = 500,hAxis = "{showTextEvery: 5,title: 
      'Returns'}",vAxis = "{gridlines : {count:4}, title : 
        'Frequency'}"))
mft = gvisHistogram(msft, options = list(histogram = "{bucketSize 
  :1}",legend = "none",title ='Distribution of MSFT Returns', 
    width = 500,hAxis = "{showTextEvery: 5,title: 'Returns'}",
      vAxis = "{gridlines : {count:4}, title : 'Frequency'}"))

We combine the two gvis objects in one browser using the gvisMerge() function:

mrg = gvisMerge(al,mft, horizontal = TRUE)
plot(mrg)

How it works…

The data.frame() function is used to construct a data frame in R. We perform this step as we do not want to plot all the three histograms on the same plot. Note the use of the $ notation in the data.frame() function.

The first argument in the gvisHistogram() function is our data stored as a data frame. We can display individual histograms using the plot(al) and plot(mft) functions. But in this recipe, we will plot the final output.

We observe that most of the attributes of histogram function are the same as discussed in previous recipes. The histogram functionality will use an algorithm to create buckets, but we can control this using the bucketSize as histogram = "{bucketSize :1}".

Try using different bucket sizes and observe how the buckets in the histograms change. More options related to histograms can also be found in the following link under the Controlling Buckets section:

https://developers.google.com/chart/interactive/docs/gallery/histogram#Buckets.

We have utilized showTextEvery, which is also very specific to histogram. This option allows us to specify how many horizontal axis labels we would like to show. We have used 5 to make the histogram more compact. Our main objective is to observe the distribution and the plot serves our purpose. Finally, we will make use of plot() to plot the chart in our favorite browser.

We do the same steps to plot return distribution of Microsoft (MSFT). Now, we would like to place both the plots side by side and view the differences in the distribution. We will use the gvisMerge() function to generate histograms side by side.

In our recipe, we have two plots for AAPL and MSFT. The default setting plots each chart vertically but we can specify horizontal = true to plot charts horizontally..

Making an interactive bubble plot

My first encounter with a bubble plot was while watching a TED video of Hans Roslling. The video led me to search for creating bubble plots in R; a very good introduction to this is available on Flowing Data website.The advantage of a bubble plot is that it allows us to visualize a third variable, which in our case would be the size of the bubble.

In this recipe, I have made use of the googleVis package to plot a bubble plot but you can also achieve this in R. The advantage of the Google Chart API is the interactivity and the ease with which they can be attached to a web page. Also note that we could also use squares instead of circles, but this is not implemented in the Google Chart API yet.

In order to implement a bubble plot, I have downloaded the crime dataset by state. The details regarding the link and definition of crime data are available in the crime.txt file and are shown in the following screenshot:

R Data Visualization Cookbook

How to do it…

As with all the plots in this article, we will install and load the googleVis Package. We will also import our data file in R using the read.csv() function:

crm = read.csv("crimeusa.csv", header = TRUE, sep =",")

We can construct our bubble chart using the gvisBubbleChart() function in R:

bub1 = gvisBubbleChart(crm,idvar = "States",xvar= "Robbery", yvar= 
  "Burglary", sizevar ="Population", colorvar = "Year",
  options = list(legend = "none",width = 900, height = 600,title 
  =" Crime per State in 2012", sizeAxis ="{maxSize : 40, minSize 
  :0.5}",vAxis = "{title : 'Burglary'}",hAxis= "{title : 
  'Robbery'}"))

bub2 = gvisBubbleChart(crm,idvar = "States",xvar= "Robbery", yvar= 
  "Burglary",sizevar ="Population",
  options = list(legend = "none",width = 900, height = 600,title 
  =" Crime per State in 2012", sizeAxis ="{maxSize : 40, minSize 
  :0.5}",vAxis = "{title : 'Burglary'}",hAxis= "{title : 
  'Robbery'}"))

The bub2 object does not size the bubbles, but shades them and the scale of shading is automatically displayed on the top of the chart. In order to view the visualization, the readers can type plot(bub2) in the R console window. To view both the bubble plots side by side, the readers can use the gvisMerge() function in R:

bub3 = gvisMerge(bub1,bub2, horizontal = TRUE)
plot(bub3)

How it works…

The gvisBubbleChart function uses four attributes to create a bubble chart, which are as follows:

  • data: This is the data defined as a data frame, in our example, crm
  • idvar: This is the vector that is used to assign IDs to the bubbles, in our example, states
  • xvar: This is the column in the data to plot on x-axis, in our example, Robbery
  • yvar: This is the column in the data to plot on y-axis, in our example, Burglary
  • sizevar: This is the column used to define the size of the bubble
  • colorvar: This is the column used to define the color

We can define the minimum and maximum sizes of each bubble using minSize and maxSize, respectively, under options(). Note that we have used gvisMerge to portray the differences among the bubble plots. In the plot on the right, we have not made use of colorvar and hence all the bubbles are of the same size.

There's more…

The Google Chart API makes it easier for us to plot a bubble, but the same can be achieved using the R basic plot function. We can make use of the symbols to create a plot. The symbols need not be a bubble; it can be a square as well. By this time, you should have watched Hans' TED lecture and would be wondering how you could create a motion chart with bubbles floating around. The Google Charts API has the ability to create motion charts and the readers can definitely use the googleVis reference manual to learn about this.

See also

Constructing a waterfall plot in R

The waterfall plots or staircase plots are observed mostly in financial reports. I have not come across nonfinancial application of these plots. The first and last columns of the plot are usually a total column and the floating columns indicate the incremental change. In this recipe, we generate some fake data of sales figures for every month.

R Data Visualization Cookbook

Getting ready

In order to generate a waterfall plot, we will require to install and load the plotrix package in R.

How to do it…

The data for the same is imported in R using the read.csv() function. If we view the data, this begins with a total from last year sales and we have some gains and some losses in sales, which are indicated by positive and negative values. The last row represents the current year sales, which is the sum of all the rows:

sales = read.csv("waterf.csv")

The waterfall plot is constructed in R using the staircase.plot() function:

staircase.plot(sales$value, totals= sales$logic, labels = 
  sales$labels, total.col = c("lightgreen"),inc.col = c("blue",
  "red","red","blue","blue","blue","red","red","blue","blue",
      "red","blue"),main ="Waterfall Plot showing financial data")

The first argument in the staircase.plot() function refers to the height of the columns. The second argument is a logical vector wherein the TRUE refers to the total columns in our data and FALSE corresponds to the incremental change. The labels argument is used to apply the labels to our plot. The total.col argument is used to apply color to the total columns (in our case, it is the first and last columns).

We specify the colors for incremental change columns under the inc.col argument. For the incremental change columns, we have repeated the blue and red colors. We would prefer to use negative values to have red colored columns and positive incremental change to have blue colored columns.

Summary

To learn more about R Data Visualization Cookbook, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended:

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

R Data Visualization Cookbook

Explore Title