This chapter will cover basic concepts of ggplot2 and the Grammar of Graphics, using illustrative examples. You will generate solutions to problems of increasing complexity throughout the book. Lastly, you will master advanced plotting techniques, which will enable you to add more detail and increase the quality of your graphics.
In order to use ggplot2, you will first need to install R and RStudio. R is a programming language that is widely used for advanced modeling, statistical computing, and graphic production. R is considered the base package, while RStudio is a graphical interface (or IDE) that is based on R. Visualization is a very important aspect of data analysis, and it has its own underlying grammar (similar to the English language). There are many aspects of data analysis, and visualization is one of them. So, before we go further, let's discuss visualization in more detail.
By the end of this chapter, you will be able to:
ggplot2 is a visualization package in R. It was developed in 2005 and it uses the concept of the Grammar of Graphics to build a plot in layers and scales. This is the syntax used for the different components (aesthetics) of a geometric object. It also involves the grammatical rules for creating a visualization.
ggplot2 has grown in popularity over the years. It's a very powerful package, and its impressive scope has been enabled by the underlying grammar, which gives the user a very file level of control - making it perfect for a range of scenarios. Another great feature of ggplot 2 is that it is programmatic; hence, its visuals are reproducible. The ggplot2 package is open source, and its use is rapidly growing across various industries. Its visuals are flexible, professional, and can be created very quickly.
Read more about the top companies using R at https://www.listendata.com/2016/12/companies-using-r.html. You can find out more about the role of a data scientist at https://www.innoarchitech.com/what-is-data-science-does-data-scientist-do/.
Other visualization packages exist, such as matplotlib (in Python) and Tableau. The matplotlib and ggplot2 packages are equally popular, and they have similar features. Both are open source and widely used. Which one you would like to use may be a matter of preference. However, although both are programmatic and easy to use, since R was built with statisticians in mind, ggplot2 is considered to have more powerful graphics. More discussion on this topic can be found in the chapter later. Tableau is also very powerful, but it is limited in terms of statistical summaries and advanced data analytics. Tableau is not programmatic, and it is more memory intensive because it is completely interactive.
Excel has also been used for data analysis in the past, but it is not useful for processing the large amounts of data encountered in modern technology. It is interactive and not programmatic; hence, charts and graphs have to be made with interactivity and need to be updated every time more data is added. Packages such as ggplot2 are more powerful in that once the code is written, ggplot is independent of increases in the data, as long as the data structure is maintained. Also, ggplot2 provides a greater number of advanced plots that are not available in Excel.
Read more about Excel versus R at https://www.jessesadler.com/post/excel-vs-r/. Read more about matplotlib versus R at http://pbpython.com/visualization-tools-1.html. Read more about matplotlib versus ggplot at https://shiring.github.io/r_vs_python/2017/01/22/R_vs_Py_post.html.
So, before we go further, let's discuss visualization in more detail. Our first task is to load a dataset. To do so, we need to load certain packages in RStudio. Take a look at the screenshot of a typical RStudio layout, as follows:
In this section, we'll load and explore a dataset using R functions. Before starting with the implementation, check the version by typing version in the console and checking the details, as follows:
Let's begin by following these steps:
install.packages("ggplot2") install.packages("tibble") install.packages("dplyr") install.packages("Lock5Data")
getwd(".")
command:[1] "C:/Users/admin/Documents/GitHub/Applied-DataVisualization-with-ggplot2-and-R"
Chapter 1
by using the following command:setwd("C:/Users/admin/Documents/GitHub/Applied-DataVisualization-with-ggplot2-and-R/Lesson1")
template_Lesson1.R
file, which has the necessary libraries.df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv")
When we used read.csv
, a structure called a data frame was created in R; which we are all familiar with it. Let's type some commands to get an overall impression of our data.
Let's retrieve some parameters of the dataset (such as the number of rows and columns) and display the different variables and their data types.
The following libraries have now been loaded:
require("ggplot2")
require("tibble")
require("dplyr")
-Reference: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.
require("Lock5Data")
Reference: https://cran.r-project.org/web/packages/Lock5Data/Lock5Data.pdf.
Use the following commands to determine the data frame details, as follows:
#Display the column names colnames(df_hum)
Take a look at the output screenshot, as shown here:
Use the following command:
#Number of columns and rows ndim(df_hum)
A summary of the data frame can be seen with the following code:
str(df_hum)
Take a look at the output screenshot, as shown here:
ggplot2 is based on two main concepts: geometric objects and the Grammar of Graphics. The geometric objects in ggplot2 are the different visual structures that are used to visualize data. We will be going over them one by one. The Grammar of Graphics is the syntax that we use for the different aesthetics of a graph, such as the coordinate scale, the fonts, the color themes, and so on. ggplot2 uses a layered Grammar of Graphics concept, which allows us to build a plot in layers. We will work on some aspects of the Grammar of Graphics in this chapter, and will go into further details in the next chapter.
Variables can be of different types and, sometimes, different software uses different names for the same variables. So, let's get familiar with the different kinds of variables that we will work with:
You can read more about variables at http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+what+are+variables.
The following table lists variables and the names that R uses for them; make sure to familiarize yourself with both nomenclatures.
In this section, we will use the built-in datasets to investigate the relationships between continuous variables, such as temperature
and airquality
. We'll explore and understand the datasets available in R.
Let's begin by executing the following steps:
data()
in the command line to list the datasets available in R. You should see something like the following:mtcars
, air quality
, rock
, and sleep
.str
command (for example, str(airquality)
).Take a look at the following output screenshot:
The outcome should be a completed table, similar to the following:
More details about variables can be found at http://www.statisticshowto.com/types-variables/.
Suppose that we want to visualize some of the variables in the built-in datasets. A dataset can contain different kinds of variables, as discussed previously. Here, the climate data includes numerical data, such as the temperature, and categorical data, such as hot or cold. In order to visualize and correlate different kinds of data, we need to understand the nomenclature of the dataset. We'll load a data file and understand the structure of the dataset and its variables by using the qplot and R base package. Let's begin by executing the following steps:
temperature
variable from the airquality
dataset, with hist(airquality$Temp)
.Take a look at the following output screenshot:
The first plot was made in the built-in graphics package in R, while the second one was made using qplot, which is a plotting command in ggplot2. We can see that the two plots look very different. The plot is a histogram of the temperature.
We will discuss geometric objects later in this chapter, in order to understand the different types of histograms.
The built-in graphics package in R does not have a lot of features, so ggplot2 has become the package of choice. For the next exercises, we will continue to investigate making plots using ggplot2.
In your mathematics class, you likely studied geometry, examining different shapes and the characteristics of those shapes, such as area, perimeter, and other factors. The geometric objects in ggplot2 are visual structures that are used to visualize data. They can be lines, bars, points, and so on.
Geometric objects are constructed from datasets. Before we construct some geometric objects, let's examine some datasets to understand the different kinds of variables.
We all love to talk about the weather. So, let's work with some weather-related datasets. The datasets contain approximately five years' worth of high-temporal resolution (hourly measurements) data for various weather attributes, such as temperature, humidity, air pressure, and so on. We'll analyze and compare the humidity and weather datasets.
Read more about weather datasets at: https://www.kaggle.com/selfishgene/historical-hourly-weather-data.
Let's begin by implementing the following steps:
df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv")
df_desc <- read.csv("data/historical-hourly-weather-data/weather_description.csv")
str
command.The outcome will be the humidity levels of different cities, as follows:
The weather descriptions of different cities are shown as follows:
The different geometric objects that we will be working with in this chapter are as follows:
One-dimensional objects are used to understand and visualize the characteristics of a single variable, as follows:
Two-dimensional objects are used to visualize the relationship between two variables, as follows:
Although geometric objects are also used in base R, they don't follow the structure of the Grammar of Graphics and have different naming conventions, as compared to ggplot2. This is an important distinction, which we will look at in detail later.
Histograms are used to group and represent numerical (continuous) variables. For example, you may want to know the distribution of voters' ages in an election. A histogram is often confused with a bar chart; however, a bar chart is more general, and we will cover those later. In a histogram, a continuous variable is grouped into bins of specific sizes and the bins have a range that covers the maximum and minimum of the variable in question.
Histograms can be classified as follows:
Let's take a look at another image:
It's important to study the shapes of distributions, as they can reveal a lot about the nature of data. We will see some real-world examples of histograms in the datasets that we will explore.
To learn more about bar charts and histograms, visit https://www2.le.ac.uk/offices/ld/resources/numericaldata/histograms. You can read more about the shapes of histograms at https://www.moresteam.com/toolbox/histogram.cfm and https://www.siyavula.com/read/maths/grade-11/statistics/11-statistics-05. Find out more about normal distributions at http://onlinestatbook.com/2/normal_distribution/history_normal.html. You will find more real-world examples athttps://stats.stackexchange.com/questions/33776/real-life-examples-of-common-distributions.
We discussed the different kinds of geometric objects that we will be working on, and we created our fist plot using two different methods (qplot
and hist
). Now, let's use another command: ggplot
. We will use the humidity data that we loaded previously.
As seen in the preceding section, we can create a default histogram by using one of the commands in the base R package: hist
. This is seen in the following command:
hist(df_hum$Vancouver)
The default histogram that will be created is as follows:
In this section, we want to visualize the humiditydistribution for the city of Vancouver. We'll create a histogram for humidity data using qplot and ggplot.
Let's begin by implementing the following steps:
qplot(df_hum$Vancouver)
:ggplot(df_hum,aes(x=Vancouver))
This command does not do anything; ggplot2 requires the name of the object that we wish to make. To make a histogram, we have to specify the geom type (in other words, a histogram). aes
stands for aesthetics, or the quantities that get plotted on thex-andy-axes, and their qualities. We will work on changing the aesthetics later, in order to visualize the plot more effectively.
Notice that there are some warning messages, as follows:
'stat_bin()' using 'bins = 30'. Pick better value with 'binwidth'.
Warning message:
Removed 1826 rows containing non-finite values (stat_bin).
You can ignore these messages; ggplot automatically detects and removes null or NA values.
ggplot (df_hum, aes(x=Vancouver)) + geom_histogram()
You'll see the following output:
Here's the output code:
require("ggplot2") require("tibble") #Load a data file - Read the Humidity Data df_hum <- read.csv("data/historical-hourly-weather-data/humidity.csv") #Display the summary str(df_hum) qplot(df_hum$Vancouver) ggplot(df_hum, aes(x=Vancouver)) + geom_histogram()
Refer to the complete code at https://goo.gl/tu7t4y.In order for ggplot to work, you will need to specify the geometric object. Note that the column name should not be enclosed in strings.
Scenario
Histograms are useful when you want to find the peak and spread in a distribution. For example, suppose that a company wants to see what its client age distribution looks like. A two-dimensional distribution can show relationships; for example, one can create a scatter plot of the incomes and ages of credit card holders.
Aim
To create and analyze histograms for the given dataset.
Prerequisites
You should be able to use ggplot2 to create a histogram.
This is an empty code, wherein the libraries are already loaded. You will be writing your code here.
Steps for Completion
Outcome
Two histograms should be created and compared. The complete code is as follows:
df_t <- read.csv("data/historical-hourly-weather-data/temperature.csv") ggplot(df_t,aes(x=Vancouver))+geom_histogram() ggplot(df_t,aes(x=Miami))+geom_histogram()
Refer to the complete code at https://goo.gl/tu7t4y.
Take a look at the following output histogram:
From the preceding plot, we can determine the following information:
Take a look at the following output histogram:
From the preceding plot, we can determine the following information:
Differences
Bar charts are more general than histograms, and they can represent both discrete and continuous data. They can even be used to represent categorical variables. A bar chart uses a horizontal or vertical rectangular bar that levels of at an appropriate level. A bar chart can be used to represent various quantities, such as frequency counts and percentages.
We will use the weather description data to create a bar chart. To create a bar chart, the geometric object used is geom_bar()
.
The syntax is as follows:
ggplot(….) + geom_bar(…)
If we use the glimpse
or str
command to view the weather data, we will get the following results:
Use the ggplot(df_vanc,aes(x=Vancouver)) + geom_bar()
command to obtain the following chart:
Observations
Vancouver has clear weather, for the most part. It rained about 10,000 times for the dataset provided. Snowy periods are much less frequent.
We will now perform two exercises, creating a one-dimensional bar chart and a two-dimensional bar chart. A one-dimensional bar chart can give us the counts or frequency of a given variable. A two-dimensional bar chart can give us the relationship between the variables.
In this section, we'll count the number of times each type of weather occurs in Seattle and compare it to Vancouver.
Let's begin by following these steps:
geom_bar
in conjunction to create the bar chart.ggplot(df_vanc,aes(x=Seattle)) + geom_bar()
You should see the following output:
Refer to the complete code at https://goo.gl/tu7t4y.
Answers
It rained on approximately 40% of the days.
A two-dimensional bar chart can be used to plot the sum of a continuous variable versus a categorical or discrete variable. For example, you might want to plot the total amount of rainfall in different weather conditions, or the total amount of sales in different months.
In this section, we'll create a two-dimensional bar chart for the total sales of a company in different months.
Let's begin by following these steps:
require (Lock5Data)
into your code. You should have installed this package previously.glimpse(RetailSales)
command.Sales
versus Month
.ggplot + geom_bar(..)
to plot this data, as follows:ggplot(RetailSales,aes(x=Month,y=Sales)) + geom_bar(stat="identity")
A screenshot of the expected outcome is as follows:
A boxplot (also known as a box and whisker diagram) is a standard way of displaying the distribution of data based on a file-number summary: minimum, first quartile, median, third quartile, and maximum. Boxplots can represent how a continuous variable is distributed for different categories; one of the axes will be a categorical variable, while the other will be a continuous variable. In the default boxplot, the central rectangle spans the first quartile to the third quartile (called the interquartile range, or IQR). A segment inside of the rectangle shows the median, and the lines (whiskers) above and below the box indicate the locations of the minimum and maximum, as shown in the following diagram:
The upper whisker extends from the hinge to the largest and smallest values of ± 1.5 * IQR from the hinge. Here, we can see the humidity data as a function of the month. Data beyond the end of the whiskers are called outliers, and are represented as circles, as seen in the following chart:
You'll get the preceding chart by using the following code:
ggplot(df_hum,aes(x=month,y=Vancouver)) + geom_boxplot()
Read more about boxplots at: http://ggplot2.tidyverse.org/reference/geom_boxplot.html.
In this section, we'll create a boxplot for monthly temperature data for Seattle and San Francisco, and compare the two (given two points).
Let's begin by implementing the following steps:
Refer to the complete code at https://goo.gl/tu7t4y.
The following observations can be noted:
The humidity is more uniform for San Francisco:
The median humidity for San Francisco is about 75:
Compare this to the humidity data for Seattle and San Francisco on the following websites (scroll down and look for the humidity plots). You should see a similar trend:https://weather-and-climate.com/average-monthly-Rainfall-Temperature-Sunshine,Seattle,United-States-of-Americahttps://weather-and-climate.com/average-monthly-Rainfall-Temperature-Sunshine,San-Francisco,United-States-of-America
A scatter plot shows the relationship between two continuous variables. Let's create a scatter plot of distance versus time for a car that is accelerating and traveling with an initial velocity. We will generate some random time points according to a function. The relationship between distance and time for a speeding car is as follows:
We can draw a scatter plot to show the relationship between distance and time with the following code:
ggplot(df,aes(x=time,y=distance)) + geom_point()
We can see a positive correlation, meaning that as time increases, distance increases. Take a look at the following code:
a=3.4 v0=27 time <- runif(50, min=0, max=200) distance <- sapply(time, function(x) v0*x + 0.5*a*x^2) df <- data.frame(time,distance) ggplot(df,aes(x=time,y=distance)) + geom_point()
The outcome is a positive correlation: as time increases, distance increases:
The correlation can also be zero (for no relationship) or negative (as x increases, y decreases).
A line chart shows the relationship between two variables; it is similar to a scatter plot, but the points are connected by line segments. One difference between the usage of a scatter plot and a line chart is that, typically, it's more meaningful to use the line chart if the variable being plotted on the x-axis has a one-to-one relationship with the variable being plotted on the y-axis. A line chart should be used when you have enough data points, so that a smooth line is meaningful to see a functional dependence:
We could have also used a line chart for the previous plot. The advantage of using a line chart is that the discrete nature goes away and you can see trends more easily, while the functional form is more effectively visualized.
If there is more than one y value for a givenx, the data needs to be grouped by the x value; then, one can show the features of interest from the grouped data, such as the mean, median, maximum, minimum, and so on. We will use grouping in the next section.
In this section, we'll create a line chart to plot the mean humidity against the month. Lets's begin by implementing the following steps:
df_hum$monthn <- as.numeric(df_hum$month)
gp1 <- group_by(df_hum,monthn)
geom_line()
command to plot the line chart (refer to the code).The following plots are obtained:
Refer to the complete code at https://goo.gl/tu7t4y.
Take a look at the output line chart:
Scenario
Suppose that we are in a company, and we have been given an unknown dataset and would like to create similar plots. For example, we have some educational data, and we would like to know what courses are the most popular, or the gender distribution among students, or how satisfied the parents/students are with the courses. We will use the new dataset, along with our own knowledge, to get some information on the preceding points.
Aim
To create one- and two-dimensional visualizations for the new dataset and the given variables.
Steps for Completion
Outcome
Three one-dimensional plots and three two-dimensional plots should be created, with the following axes (count versus topic) and observations. (Note that the students may provide different observations, so the instructor should verify the answers. The following observations are just examples.)
Refer to the complete code at https://goo.gl/tu7t4y.
This visual was chosen becauseTopic is a categorical variable, and I wanted to see the frequency of each topic:
Observation
You can see that IT is the most popular subject:
gender is a categorical variable; you can chose a bar chart because you wanted to see the frequency of each topic.
Observation
You can observe that more males are registered in this institute from the following histogram:
VisitedResources is numerical, so you can choose a histogram to visualize it.
Observation
It's a bimodal histogram with two peaks, around 12 and 85.
Take a look at the following 2D plots:
Plot 1:
Plot 2:
Plot 3:
Observations
It is also possible to plot using three-dimensional vectors. This creates a three-dimensional plot, which provides enhanced visualization for applications (for example, displaying three-dimensional spaces). Essentially, it is a graph of two functions, embedded into a three-dimensional environment.
Read more about three-dimensional plots at: https://octave.org/doc/v4.2.0/Three_002dDimensional-Plots.html.
The Grammar of Graphics is the language used to describe the various components of a graphic that represent the data in a visualization. Here, we will explore a few aspects of the Grammar of Graphics, building upon some of the features in the graphics that we created in the previous topic. For example, a typical histogram has various components, as follows:
All of these aspects are part of the Grammar of Graphics, and we will change these aspects to provide better visualization. In this chapter, we will work with some of the aspects; we will explore them further in the next chapter.
Read more about the Grammar of Graphics at https://cfss.uchicago.edu/dataviz_grammar_of_graphics.html.
In a histogram, data is grouped into intervals, or ranges of values, called bins. ggplot has a certain number of bins by default, but the default may not be the best choice every time. Having too many bins in a histogram might not reveal the shape of the distribution, while having too few bins might distort the distribution. It is sometimes necessary to rebin a histogram, in order to get a smooth distribution.
Let's use the humidity data and the first plot that we created. It looks like the humidity values are discrete, which is why you can see discrete peaks in the data. In this section, we'll analyze the differences between unbinned and binned histograms.
Let's begin by implementing the following steps:
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15)
You'll get the following output. Graph 1:
Graph 2:
Choosing a different type of binning can make the distribution more continuous, and one can then better understand the distribution shape. We will now build upon the graph, changing some features and adding more layers.
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1)
+ggtitle("Humidity for Vancouver city")
+xlab("Humidity")+theme(axis.text.x=element_text(size = 12),axis.text.y=element_text(size=12))
You should see the following output:
The full command should look as follows:ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1)+ggtitle("Humidity for Vancouver city")+xlab("Humidity")+theme(axis.text.x=element_text(size= 12),axis.text.y=element_text(size=12))
We can see that the second plot is a visual improvement, due to the following factors:
To see what else can be changed, type ?theme
.
In this section, we'll use the Grammar of Graphics to change defaults and create a better visualization.
Let's begin by implementing the following steps:
?geom_boxplot
in the command line, then look for the aesthetics, including the color and the fill color.?theme
to find out how to change the label size to 15. Change thex- and y-axis titles to size 15 and the color to red.The outcome will be the complete code and the graphic with the correct changes:
Refer to the complete code at https://goo.gl/tu7t4y.
Scenario
In the previous activity, you made a judicious choice of a geometric object (bar chart or histogram) for a given variable. In this activity, you will see how to improve a visualization. If you are producing plots to look at privately, you might be okay using the default settings. However, when you are creating plots for publication or giving a presentation, or if your company requires a certain theme, you will need to produce more professional plots that adhere to certain visualization rules and guidelines. This activity will help you to improve visuals and create a more professional plot.
Aim
To create improved visualizations by using the Grammar of Graphics.
Steps for Completion
Refer to the complete code at https://goo.gl/tu7t4y.
Take a look at the following output, histogram 1:
Histogram 2:
In this chapter, we covered the basics of ggplot2, distinguishing between different types of variables and introducing the best practices for visualizing them. You created basic one- and two-dimensional plots, then analyzed the differences between them. You used the Grammar of Graphics to change a basic visual into a better, more professional-looking visual.
In the next chapter, we will build upon these skills, uncovering correlations between variables and using statistical summaries to create more advanced plots.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.