Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Data Visualization with ggplot2

The previous chapter covered intermediate data processing techniques, focusing on dealing with string data. When the raw data has been transformed and processed into a clean and structured shape, we can take the analysis to the next level by visualizing the clean data in a graph, which we aim to accomplish in this chapter.

By the end of this chapter, you will be able to plot standard graphs using the ggplot2 package and add customizations to present excellent visuals.

In this chapter, we will cover the following topics:

  • Introducing ggplot2
  • Understanding the grammar of graphics
  • Geometries in graphics
  • Controlling themes in graphics

Technical requirements

To complete the exercises in this chapter, you will need to have the latest versions of the following packages:

  • The ggplot2 package, version 3.3.6. Alternatively, install the tidyverse package and load ggplot2 directly.
  • The ggthemes package, version 4.2.4.

The versions mentioned along with the packages in the preceding list are the latest ones while I am writing this book.

All the code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/tree/main/Chapter_4.

Introducing ggplot2

Conveying information via graphs tends to be more effective and visually appealing than tables alone. After all, humans are much quicker at processing visual information, such as recognizing a car in an image. In building machine learning (ML) models, we are often interested in the training and test loss profile in the form of a line chart that indicates the reduction in the training and test set loss as the model gets trained for a more extended period. Observing performance metrics helps us better diagnose whether a model is underfitting or overfitting—in other words, whether the current model is too simple or overly complex. Note that the test set is used to approximate a future dataset, and minimizing the test set error helps the model generalize to new datasets, an approach known as empirical risk minimization. Underfitting refers to the case when the model does poorly in both training and test sets due to insufficient fitting power, while overfitting...

Understanding the grammar of graphics

The previous example contained the three essential layers that need to be specified when plotting a graph: data, aesthetics, and geometries. The primary purpose of each layer is listed as follows:

  • The data layer specifies the dataset to be plotted. This corresponds to the mtcars dataset we specified earlier.
  • The aesthetics layer specifies the scale-related items that map the variables to the visual properties of the plot. Examples include the variables to be shown for the x axis and y axis, the size and color, and other plot aesthetics. This corresponds to the cyl and mpg variables we specified earlier.
  • The geometry layer specifies the visual elements used for the data, such as presenting the data via points, lines, or other forms. The geom_point() command we set in the previous example tells the plot to be shown as a scatter plot.

Other layers, such as the theme layer, also help beautify the plot, which we will cover later...

Geometries in graphics

The previous section mostly covered scatter plots. In this section, we will go over two additional common types of plots: bar charts and line plots. We will discuss different ways to construct these plots, focusing on the geometries that can be used to control layer-specific visual properties of the graph.

Understanding geometry in scatter plots

Let us revisit the scatter plot and zoom in on the geometry layer. The geometry layer determines how the plot actually looks, which is an essential layer in our visual communication. At the time of writing, there are over 50 geometries we can choose from, all of which start with the geom_ keyword.

Some overall guidelines apply when deciding which type of geometry to use. For example, the following list contains the possible kinds of applicable geometries for a typical scatter plot:

  • Point, which visualizes the data as points
  • Jitter, which adds positional jittering to a scatter plot
  • Abline, which...

Controlling themes in graphics

The theme layer specifies all non-data-related properties on the plot, such as the background, legend, axis labels, and so on. Proper control of the themes in the plot could aid visual communication by highlighting critical information and directing users’ attention to the intended message we would like to convey.

There are three types of visual elements controlled by the theme layer, as follows:

  • Text, used to specify the textual display (for example, color) of the axis label
  • Line, used to specify the visual properties of the axes such as color and line type
  • Rectangle, used to control the borders and backgrounds of the plot

All three types are specified using functions that start with element_, including examples such as element_text() and element_line(). We will go over these functions in the following section.

Adjusting themes

The theme layer can be easily applied as an additional layer on the existing graph. Let...

Summary

In this chapter, we introduced essential graphics techniques based on the ggplot2 package. We started by going over the basic scatter plot and learned the grammar of developing layers in a plot. To build, edit, and improve a plot, we need to specify three essential layers: data, aesthetics, and geometries. For example, the geom_point() function used to build a scatter plot allows us to control the size, shape, and color of the points on a graph. We can also display them as text in addition to presenting points using the geom_text() function.

We also covered the layer-specific control provided by the geometry layer and showed examples using bar charts and line plots. A bar chart can help represent the frequency distribution of categorical variables and the histogram of continuous variables. A line chart supports time series data and can help identify trends and patterns if appropriately plotted.

Finally, we also covered the theme layer, which allows us to control all non...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng