Reader small image

You're reading from  Applied Data Visualization with R and ggplot2

Product typeBook
Published inSep 2018
Reading LevelIntermediate
Publisher
ISBN-139781789612158
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dr. Tania Moulik
Dr. Tania Moulik
author image
Dr. Tania Moulik

Tania Moulik has a PhD in particle physics. She has worked at CERN, the European Organization for Nuclear Research, and on the Tevatron at Fermi National Accelerator Laboratory in IL, USA. She has years of programming experience in C++, Python, and R. She has also worked in the feld of big data and has worked with technologies such as grid computing. She has a passion for data analysis and would like to share her passion with others who would like to delve into the world of data analytics. She especially likes R and ggplot2 as a powerful analytics package.
Read more about Dr. Tania Moulik

Right arrow

The Grammar of Graphics


The Grammar of Graphics is the language used to describe the various components of a graphic that represent the data in a visualization. Here, we will explore a few aspects of the Grammar of Graphics, building upon some of the features in the graphics that we created in the previous topic. For example, a typical histogram has various components, as follows:

  • The data itself (x)
  • Bars representing the frequency of x at different values of x
  • The scaling of the data (linear)
  • The coordinate system (Cartesian)

All of these aspects are part of the Grammar of Graphics, and we will change these aspects to provide better visualization. In this chapter, we will work with some of the aspects; we will explore them further in the next chapter.

Note

Read more about the Grammar of Graphics at https://cfss.uchicago.edu/dataviz_grammar_of_graphics.html.

Rebinning

In a histogram, data is grouped into intervals, or ranges of values, called bins. ggplot has a certain number of bins by default, but the default may not be the best choice every time. Having too many bins in a histogram might not reveal the shape of the distribution, while having too few bins might distort the distribution. It is sometimes necessary to rebin a histogram, in order to get a smooth distribution.

Analyzing Various Histograms

Let's use the humidity data and the first plot that we created. It looks like the humidity values are discrete, which is why you can see discrete peaks in the data. In this section, we'll analyze the differences between unbinned and binned histograms.

Let's begin by implementing the following steps:

  1. Choosing a different type of binning can make the distribution more continuous; use the following code:
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15)

You'll get the following output. Graph 1:

 

 

Graph 2:

Note

Choosing a different type of binning can make the distribution more continuous, and one can then better understand the distribution shape. We will now build upon the graph, changing some features and adding more layers.

  1. Change the fill color to white by using the following command:
ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1) 
  1. Add a title to the histogram by using the following command:
+ggtitle("Humidity for Vancouver city")
  1. Change the x-axis label and label sizes, as follows:
+xlab("Humidity")+theme(axis.text.x=element_text(size = 12),axis.text.y=element_text(size=12))

You should see the following output:

Note

The full command should look as follows:ggplot(df_hum,aes(x=Vancouver))+geom_histogram(bins=15,fill="white",color=1)+ggtitle("Humidity for Vancouver city")+xlab("Humidity")+theme(axis.text.x=element_text(size= 12),axis.text.y=element_text(size=12))

We can see that the second plot is a visual improvement, due to the following factors:

  • There is a title
  • The font sizes are visible
  • The histogram looks more professional in white

To see what else can be changed, type ?theme.

Changing Boxplot Defaults Using the Grammar of Graphics

In this section, we'll use the Grammar of Graphics to change defaults and create a better visualization.

Let's begin by implementing the following steps:

  1. Use the humidity data to create the same boxplot seen in the previous section, for plotting monthly data.
  2. Change the x- and y-axis labels appropriately (the x-axis is the month and the y-axis is the humidity).
  3. Type ?geom_boxplot in the command line, then look for the aesthetics, including the color and the fill color.
  4. Change the color to black and the fill color to green (try numbers from 1-6).
  5. Type ?theme to find out how to change the label size to 15. Change thex- and y-axis titles to size 15 and the color to red.

The outcome will be the complete code and the graphic with the correct changes:

Note

Refer to the complete code at https://goo.gl/tu7t4y.

Activity: Improving the Default Visualization

Scenario

In the previous activity, you made a judicious choice of a geometric object (bar chart or histogram) for a given variable. In this activity, you will see how to improve a visualization. If you are producing plots to look at privately, you might be okay using the default settings. However, when you are creating plots for publication or giving a presentation, or if your company requires a certain theme, you will need to produce more professional plots that adhere to certain visualization rules and guidelines. This activity will help you to improve visuals and create a more professional plot.

Aim

To create improved visualizations by using the Grammar of Graphics.

Steps for Completion

  1. Create two of the plots from the previous activity.
  2. Use the Grammar of Graphics to improve your graphics by layering upon the base graphic.

Note

Refer to the complete code at https://goo.gl/tu7t4y.

Take a look at the following output, histogram 1:

Histogram 2:

Previous PageNext Page
You have been reading a chapter from
Applied Data Visualization with R and ggplot2
Published in: Sep 2018Publisher: ISBN-13: 9781789612158
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dr. Tania Moulik

Tania Moulik has a PhD in particle physics. She has worked at CERN, the European Organization for Nuclear Research, and on the Tevatron at Fermi National Accelerator Laboratory in IL, USA. She has years of programming experience in C++, Python, and R. She has also worked in the feld of big data and has worked with technologies such as grid computing. She has a passion for data analysis and would like to share her passion with others who would like to delve into the world of data analytics. She especially likes R and ggplot2 as a powerful analytics package.
Read more about Dr. Tania Moulik