Reader small image

You're reading from  R Bioinformatics Cookbook - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781837634279
Edition2nd Edition
Right arrow
Author (1)
Dan MacLean
Dan MacLean
author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean

Right arrow

ggplot2 and Extensions for Publication Quality Plots

Clear and informative data visualizations are the most important tool that bioinformaticians have to effectively communicate complex data and findings to other scientists in the field. They allow for easy and efficient exploration and understanding of large and complex datasets. The process of creating a good visualization is very iterative, and many drafts of a visualization are discarded before a final one is settled on, so it is important that we have plotting tools that allow for quick and easy plot creation and customization.

ggplot2 is a popular data visualization library in R that provides an elegant solution for bioinformaticians. It is based on the Grammar of Graphics, a principle that allows users to easily create complex and customizable visualizations by breaking them down into small, modular components, defined by a consistent interface. These make ggplot2 highly flexible and allow for the creation of a wide variety...

Technical requirements

We will use renv to manage packages in a project-specific way. To use renv to install packages, you will first need to install the renv package itself. Here’s how to install renv and then use it to install packages:

  1. Run the following command in your R console:
    install.packages("renv")
  2. Next, you will need to create a new renv environment for your project by running the following command:
    renv::init()
  3. This will create a new directory called .renv in your current project directory.
  4. You can then install packages with the following command:
    install.packages("<package name>")
  5. You can also use the renv package manager to install Bioconductor packages by running the following command:
    renv::install("bioc::<package name>")
  6. For example, to install the Biobase package, you would run the following command:
    renv::install("bioc::Biobase")
  7. You can use renv to install development packages from GitHub...

Combining many plot types in ggplot2

The layer model of ggplot2 is a key feature of the library that allows users to create complex visualizations by building up layers of data, aesthetics, and geoms. Each layer represents a different aspect of the plot, and they are added on top of each other to create the final visualization. In this recipe, we’ll use the layer model to create a complex plot of data in the palmerpenguins package. It may be helpful to inspect the data in R directly by printing it to the screen. Also, the package is well documented at https://allisonhorst.github.io/palmerpenguins/, should you wish to look more into how it was generated.

Getting ready

Install the ggplot2 and palmerpenguins packages.

How to do it…

We can use the layer system to combine multiple plot types as follows:

  1. Create the base for the plot:
    library(ggplot2)library(palmerpenguins)p <- ggplot(data = penguins) +  aes(x = bill_length_mm, y = bill_depth_mm...

Comparing changes in distributions with ggridges

Ridge plots, also known as joyplots, are a visualization tool that allows for the clear comparison of multiple distributions in a single plot. The ggridges R package provides an easy-to-use implementation of ridge plots, allowing for the clear comparison of multiple distributions of a single variable by superimposing them on top of each other in a single plot. The package also allows for easy customization of plot features such as color, fill, and theme. The ggridges package is particularly useful for comparing the distribution of a single variable across multiple groups or categories. In this recipe, we will look at implementing some useful ridge plots.

Getting ready

We will need the ggplot2, ggridges, and palmerpenguins packages.

How to do it…

We can look at the changes in distributions using the following steps:

  1. Plot overlapping distributions:
    library(ggplot2)library(ggridges)library(palmerpenguins)ggplot...

Customizing plots with ggeasy

One of the key aspects of customizing plots in ggplot2 is the theme() function, which allows users to customize elements of the plot’s overall appearance. Customizing plots in ggplot2 can be a little unintuitive. Although the theme() function is powerful, it does require the user to manually specify each element of the plot, such as axis labels, titles, colors, and shapes. The ggeasy package, built on top of ggplot2, aims to make plot customization more accessible by providing a simpler, more intuitive syntax for many common customization tasks. ggeasy provides a set of simple wrapper functions around theme() that make the important things a lot easier to remember. With this recipe, we’ll look at customizing labels, legends, and axes in a plot created initially in ggplot2.

Getting ready

We’ll need the ggplot2, ggeasy, and palmerpenguins packages.

How to do it…

We can customize a plot as follows.

Make a base plot...

Highlighting selected values in busy plots with gghighlight

Bioinformatics datasets often comprise measurements of many items. The genomes we analyze have thousands of genes, but usually, we’re only interested in the few that respond to particular changes in the experiment we have designed. So, it’s of great use to be able to highlight those few in our plots. In this recipe, we’ll look at the gghighlight package, which can make that very easy.

Getting ready

We’ll need the gghighlight, ggplot2, and rbioinfcookbook packages for the main functions. We’ll also use dplyr briefly. The datasets for these are fission yeast wt versus mutant gene expression data and an Arabidopsis treatment timecourse. The columns in the data are for the log 2 ratio of gene expression in mutant versus wt and the p-value from a statistical test.

How to do it…

We can highlight selected values in a plot such as a gene expression plot using the following steps...

Plotting variability and confidence intervals better with ggdist

Confidence intervals are used to make inferences about a population based on a sample of data. They capture the variability of the data by providing a range of possible values for some parameter, rather than a single point estimate. The interval is a measure of how sure we are that the interval contains the true population parameter. It is common to show distributions and annotate them with range markers or confidence intervals. With this recipe, we will look at how to use ggplot’s ggdist extension to make informative and great-looking plots of distributions.

Getting ready

For this recipe, we need the ggdist, ggplot2, and palmerpenguins packages.

How to do it…

We can create plots with confidence intervals as follows:

  1. Create a raincloud plot:
    library(ggplot2)library(ggdist)library(palmerpenguins)ggplot(penguins) +  aes(x = flipper_length_mm, y = island) +  geom_dots...

Making interactive plots with plotly

Interactive plots are great tools for data exploration, allowing users to explore interactively large datasets to gain insights and identify patterns in data. They are useful for programmers wishing to create dashboards for visualizing real-time data and help with interactive presentations that can communicate complex data relationships in an engaging manner. plotly is a data visualization library for creating interactive plots in Python, R, and JavaScript. It provides a high-level interface for drawing attractive and informative statistical graphics, and the ggplotly package in R allows you to convert static ggplot2 visualizations to interactive plots through a high-level interface. In this recipe, we’ll create a fairly involved ggplot2 visualization of mutation sites on a genome and then convert it to plotly to get a great first-level interaction layer.

Getting ready

We’ll need the ggplot2, plotly, and rbioinfcookbook packages...

Clarifying label placement with ggrepel

Bioinformatics datasets often have many thousands of data points. These can be genomic positions or genes within a genome, and as part of our data analysis, we will frequently want to label positions or genes so that the reader can identify them. A problem arises in that the labels can easily overlap or clash in the plots. The ggrepel package provides geoms for ggplot2 that allow for labels to be positioned much more clearly, incorporating label layout algorithms that make labels and connecting lines repel intelligently. In this recipe, we’ll look at the most important options for applying that to a genomics dataset.

Getting ready

We’ll need the ggplot2 and ggrepel packages and the fission yeast gene expression dataset in the rbioinfcookbook data package. This data frame contains yeast gene IDs in one column, the log 2-fold change of gene expression for that gene, and the p-value from a statistical test.

How to do it…...

Zooming and making callouts from selected plot sections with facetzoom

We’ve already seen in these recipes how bioinformatics datasets can encompass very large scales. Genomes can be thousands of millions of bases long and contain tens of thousands of genes, taxa can have thousands of members, and biomes can have billions of individuals living in areas of a wide range of sizes. Contextual information is therefore often important in analysis and visualization; we may want to see a detail of some subset of data in its original broader context. We can do that by using plots with callout-style subplots—zoomed-in areas drawn alongside the wider data. In this recipe, we will look at using the facet zoom functionality in the ggforce package to look at an area of interest in a ggplot.

Getting ready

We’ll use the ggplot2, ggforce, palmerpenguins, and rbioinfcookbook packages for the main part of this recipe. The allele_freq and penguins datasets will be the basis...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Bioinformatics Cookbook - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781837634279
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean