R Bioinformatics Cookbook

By Dan MacLean
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Performing Quantitative RNAseq

About this book

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you’ll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples.

This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse.

By the end of this book, you’ll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.

Publication date:
October 2019


Performing Quantitative RNAseq

The technology of RNAseq has revolutionized the study of transcript abundances, bringing high-sensitivity detection and high-throughput analysis. Bioinformatic analysis pipelines using RNAseq data typically start with a read quality control step followed by either alignment to a reference or the assembly of sequence reads into longer transcripts de novo. After that, transcript abundances are estimated with read counting and statistical models and differential expression between samples is assessed. Naturally, there are many technologies available for all steps of this pipeline. The quality control and read alignment steps will usually take place outside of R, so analysis in R will begin with a file of transcript or gene annotations (such as GFF and BED files) and a file of aligned reads (such as BAM files). 

The tools in R for performing analysis are powerful and flexible. Many of them are part of the Bioconductor suite and, as such, integrate together very nicely. The key question researchers wish to answer with RNAseq is usually: Which transcripts are differentially expressed? In this chapter, we'll look at some recipes for that in standard cases where we already know the genomic positions of genes we're interested in, and in cases where we need to find unannotated transcripts. We'll also look at other important recipes that help answer the questions How many replicates are enough? and Which allele is expressed more?

In this chapter, we will cover the following recipes:

  • Estimating differential expression with edgeR
  • Estimating differential expression with DESeq2
  • Power analysis with powsimR
  • Finding unannotated transcribed regions with GRanges objects
  • Finding regions showing high expression ab initio with bumphunter
  • Differential peak analysis
  • Estimating batch effects using SVA
  • Finding allele-specific expression with AllelicImbalance
  • Plotting and presenting RNAseq data

Technical requirements

The sample data you'll need is available from this book's GitHub repository: https://github.com/PacktPublishing/R-Bioinformatics_Cookbook. If you want to use the code examples as they are written, then you will need to make sure that this data is in a sub-directory of whatever your working directory is.

Here are the R packages that you'll need. Most of these will install with install.packages()others are a little more complicated:

  • Bioconductor
    • AllelicImbalance
    • bumphunter 
    • csaw
    • DESeq
    • edgeR
    • IRanges
    • Rsamtools
    • rtracklayer
    • sva
    • SummarizedExperiment
    • VariantAnnotation
  • dplyr
  • extRemes
  • forcats
  • magrittr
  • powsimR
  • readr 

Bioconductor is huge and has its own installation manager. You can install it with the following code:

if (!requireNamespace("BiocManager"))
 Further information is available at https://www.bioconductor.org/install/.

Normally, in R, a user will load a library and use the functions directly by name. This is great in interactive sessions but it can cause confusion when many packages are loaded. To clarify which package and function I'm using at a given moment, I will occasionally use the packageName::functionName() convention. 

Sometimes, in the middle of a recipe, I'll interrupt the code so you can see some intermediate output or the structure of an object it's important to understand. Whenever that happens, you'll see a code block where each line begins with ## (double hash symbols). Consider the following command:


This will give us output as follows:

## a b c d e

Note that the output lines are prefixed with ##.


Estimating differential expression with edgeR

edgeR is a widely used and powerful package that implements negative binomial models suitable for sparse count data such as RNAseq data in a general linear model framework, which are powerful for describing and understanding count relationships and exact tests for multi-group experiments. It uses a weighted style normalization called TMM, which is the weighted mean of log ratio between sample and control, after removal of genes with high counts and outlying log ratios. The TMM value should be close to one, but can be used as a correction factor to be applied to overall library sizes

In this recipe, we'll look at some options from preparing read counts for annotated regions in some object to identifying the differentially expressed features in a genome. Usually, there is an upstream step requiring us to take high-throughput sequence reads, align them to a reference and produce files describing those alignments, such as .bam files. With those files prepared, we'd fire up R and start to analyze. So that we can concentrate on the differential expression analysis part of the process, we'll use a prepared dataset for which all of the data is ready. Chapter 8, Working with Databases and Remote Data Sources, shows you how to go from raw data to this stage if you're looking for how to do that step.

As there are many different tools and methods for getting those alignments of reads, we will look at starting the process with two common input object types. We'll use a count table, like that we would have if we were loading from a text file and we'll use an ExpressionSet (eset) object, which is an object type common in Bioconductor.

Our prepared dataset will be the modencodefly data from the NHGRI encyclopedia of DNA elements project for the model organism, Drosophila melanogaster. You can read about this project at www.modencode.org. The dataset contains 147 different samples for D. melanogaster, a fruit fly with an approximately 110 Mbp genome, annotated with about 15,000 gene features. 

Getting ready

The data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RDatadatasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txtWe'll also use the edgeR (from Bioconductor), readr, and magrittr libraries.

How to do it...

We will see two ways of estimating differential expressions with edgeR.

Using edgeR from a count table

For estimating differential expressions with edgeR from a count table (for example, in a text file), we will use the following steps:

  1. Load the count data:
count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1", "modencodefly_count_table.txt" ))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes
pheno_data <- readr::read_table2(file.path(getwd(), "datasets", "ch1", "modencodefly_phenodata.txt"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( pheno_data[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
grouping <- pheno_data[['stage']][columns_of_interest] %>%
  1. Form the subset of count data:
counts_of_interest <-  count_matrix[,columns_of_interest]
  1. Create the DGE object:
count_dge <- edgeR::DGEList(counts = counts_of_interest, group = grouping)
  1. Perform differential expression analysis:
design <- model.matrix(~ grouping)
eset_dge <- edgeR::estimateDisp(eset_dge, design)
fit <- edgeR::glmQLFit(eset_dge, design)
result <- edgeR::glmQLFTest(fit, coef=2)

Using edgeR from an ExpressionSet object

Estimating using edgeR from our prepared eset object can be done using the following steps:

  1. Load the eset data:
load(file.path(getwd(), "datasets/ch1/modencodefly_eset.RData"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( phenoData(modencodefly.eset)[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
grouping <- droplevels(phenoData(modencodefly.eset)[['stage']][columns_of_interest] )
  1. Form the subset of count data:
counts_of_interest <- exprs(modencodefly.eset)[, columns_of_interest]
  1. Create the DGE object:
eset_dge <- edgeR::DGEList(
counts = counts_of_interest,
group = grouping
  1. Perform differential expression analysis:
design <- model.matrix(~ grouping)
eset_dge <- edgeR::estimateDisp(eset_dge, design)

fit <- edgeR::glmQLFit(eset_dge, design)
result <- edgeR::glmQLFTest(fit, coef=2)

How it works...

We saw two ways of estimating differential expression with edgeR. In the first half of this recipe, we used edgeR starting with our data in a text file.

Using edgeR from a count table

In step 1, we use the read_tsv() function in the readr package to load the tab delimited text file of counts into a dataframe called count_dataframe. Then, from that, we extract the 'gene' column to a new variable, genes, and erase it from count_dataframe, by assigning NULL. This is all done so we can easily convert into the count_matrix matrix with the base as.matrix() function and add the gene information back as rownames. Finally, we load the phenotype data we'll need from file using the readr read_table2() function. 

Step 2 is concerned with working out which columns in count_matrix we want to use. We define a variable, experiments_of_interest, which holds the column names we want and then use the %in% operator and which() functions to create a binary vector that matches the number of columns. If, say, the third column of the columns_of_interest vector is TRUE it indicates the name was in the experiments_of interest variable. 

Step 3 begins with loading the magrittr package to get the %>% operator, which will allow piping. We then use R indexing with the binary columns_of_interest factor to select the names of columns we want and send it to the forcats as_factor() function to get a factor object for our grouping variable. Sample grouping information is basically a factor that tells us which samples are replications of the same thing and it's important for the experimental design description. We need to create a grouping vector, each index of which refers to a column in the counts table. So, in the following example, the first three columns in the data would be replicates of one sample, the second three columns in the counts table would be replicates of a different replicate, and so on. We can use any symbols in the grouping vector to represent the groups. The more complicated the grouping vector, the more complicated the experiment design can be. In the recipe here, we'll use a simple test/control design:

numeric_groups <- c(1,1,1,2,2,2)
letter_groups <- c("A","A","A", "B","B","B")

A simple vector like this will do, but you can also use a factor object. The factor is R's categorical data type and is implemented as a vector of integers that have associated name labels, called levels. When a factor is displayed, the name labels are taken instead of the integers. The factor object has a memory of sorts, and even when a subset of levels is used, all of the levels that could have been used are retained so that when, for example, the levels are used as categories, empty levels can still be displayed.

In Step 4, we use indexing to extract the columns of data we want to actually analyze.

By Step 5, our preparatory work is done and we can build the DGEList object we need to do differential analysis. To start, we load the edgeR library and use the DGEList() function on counts_of_interest and our grouping object.

In Step 6, with DGEList, we can go through the edgeR process. First, we create the experimental design descriptor design object with the base model.matrix() function. A model design is required to tell the functions how to compare samples; this is a common thing in R and so has a base function. We use the grouping variable we created. We must estimate the dispersions of each gene with the estimateDisp() function, then we can use that measure of variability in tests. Finally, a generalized linear model is fit and the quasi-likelihood F-test is applied with the two uses of glmQLFTest(), first with the dispersal estimates, eset_dge, then with the resulting fit object.

We can use the topTags() function to see the details of differentially expressed genes. We get the following output:

 ## Coefficient: groupingL2Larvae
## logFC logCPM F PValue FDR
## FBgn0027527 6.318665 11.14876 42854.72 1.132951e-41 1.684584e-37
## [ reached 'max' / getOption("max.print") -- omitted 9 rows ]

The columns show the gene name, the logFC value of the gene, the F value, the P value and the False Detection Rate (FDR). Usually, the column we want to make statistical conclusions from is FDR.

Using edgeR from an ExpressionSet object

In Step 1, we are looking at using edgeR from our prepared eset object. We first load that in, using the base R function as it is stored in a standard Rdata format file.

In Step 2, we prepare the vector of experiments of interest. This works as in step 2, except that we don't need to look at the pheno_data object we created from a file; instead, we can use the eset function, phenoData(), to extract the phenotype data straight from the eset object (note that this is one of the major differences between eset and the count matrix—see this book's Appendix for further information).

In Step 3, we create the grouping factor. Again, this can be done by using the phenoData() extraction function, but, as it returns a factor, we need to drop the levels that aren't selected using the droplevels() function (see the How it works... section in the Estimating differential expression with edgeR recipe, step 3 from the previous method, for a brief discussion of factor objects).

In step 4, we extract the data for the columns we are interested in into a standard matrix object. Again, we have a dedicated function, exprs(), for extracting the expression values from eset, and we can subset that using column indexing with column_names.

In Step 5, we use the DGEList() constructor function to build the data structure for edgeR and in step 6, carry out the analysis. This step is identical to Step 6 of the first method.


Estimating differential expression with DESeq2

The DESeq2 package is a method for differential analysis of count data, so it is ideal for RNAseq (and other count-style data such as ChIPSeq). It uses dispersion estimates and relative expression changes to strengthen estimates and modeling with an emphasis on improving gene ranking in results tables. DESeq2 differs from edgeR in that it uses a geometric style normalization in which the per lane scaling factor is computed as the median of the ratios of the gene count over its geometric mean ratio, whereas edgeR uses the weighted one. The two normalization strategies are not mutually exclusive and both make different assumptions about the data. As with any RNAseq or large scale experiment, there is never an "out-of-the-box" best answer. You'll end up testing these methods and maybe others and closely examining results from control genes and cross-validation experiments to see which performs best. The performance will depend greatly on the particular dataset at hand, so the flexible approach we learn here will give you a good idea of how to test the different solutions for yourself.

The process we'll look at in this recipe is somewhat similar to that for edgeR in the preceding Recipe 1. We can use both ExpressionSets and count tables as input to DESeq2 and, when we've prepared them, we have a different set of functions to use to get our data into a DESeqDataSet, not the DGEList as with edgeR.

Getting ready

As in Recipe 1, the data is provided as both a count matrix and an ExpressionSet object and you can see the Appendix at the end of this book for further information on these object types. The data is in this book's code and data repository at https://github.com/PacktPublishing/R_Bioinformatics_Cookbook under datasets/ch1/modencodefly_eset.RData , datasets/ch1/modencodefly_count_table.txt, and datasets/ch1/modencodelfy_phenodata.txt. Again, we'll use readr and magrittr and, from Bioconductor, SummarizedExperiement, and DESeq2.

How to do it...

Estimating differential expressions with DESeq2 can be done in two ways, as shown in the following section.

Using DESeq2 from a count matrix

Estimating differential expressions with DESeq2 from a count table (for example, in a text file), we will use the following steps:

  1. Load count data:
count_dataframe <- readr::read_tsv(file.path(getwd(), "datasets", "ch1", "modencodefly_count_table.txt" ))
genes <- count_dataframe[['gene']]
count_dataframe[['gene']] <- NULL
count_matrix <- as.matrix(count_dataframe)
rownames(count_matrix) <- genes
pheno_data <- readr::read_table2(file.path(getwd(), "datasets", "ch1", "modencodefly_phenodata.txt"))
  1. Specify experiments of interest:
experiments_of_interest <- c("L1Larvae", "L2Larvae")
columns_of_interest <- which( pheno_data[['stage']] %in% experiments_of_interest )
  1. Form the grouping factor:
grouping <- pheno_data[['stage']][columns_of_interest] %>%
  1. Form the subset of count data:
counts_of_interest <- count_matrix[,columns_of_interest]
  1. Build the DESeq object:
dds <- DESeqDataSetFromMatrix(countData = counts_of_interest,
colData = grouping,
design = ~ stage)
  1. Carry out the analysis:
dds <- DESeq(dds)
  1. Extract the results:
res <- results(dds, contrast=c("stage","L2Larvae","L1Larvae"))

Using DESeq2 from an ExpressionSet object

To estimate differential expressions with DESeq2 from an ExpressionSet object, we will use the following steps:

  1. Load the eset data and convert into DESeqDataSet():
load(file.path(getwd(), "datasets/ch1/modencodefly_eset.RData"))
summ_exp <- makeSummarizedExperimentFromExpressionSet(modencodefly.eset)
ddsSE <- DESeqDataSet(summ_exp, design= ~ stage)
  1. Carry out analysis and extract results:
ddsSE <- DESeq(ddsSE)
resSE <- results(ddsSE, contrast=c("stage","L2Larvae","L1Larvae"))

How it works...

In the first section of this recipe, we used DESeq1 starting with our data in a text file; as you'll notice steps 1 to 4 are identical to those in the previous section.

Using DESeq2 from a count matrix

In Step 1, we use the readr package's read_tsv() function to load the tab-delimited text file of counts into a dataframe called count_dataframe. Then, from that, we extract the 'gene' column to a new variable, genesand erase it from count_dataframe, by assigning NULL. This is all done so we can easily convert into the count_matrix matrix with the base as.matrix() function and add the gene information back as rownames. Finally, we load the phenotype data we'll need from the file using the readr read_table2() function. 

Step 2 is concerned with working out which columns in count_matrix we want to use. We define a variable, experiments_of_interestthat holds the column names we want and then use the %in% operator and which() functions to create a binary vector that matches the number of columns. If, say the third column of the columns_of_interest vector is 'TRUE', it indicates the name was in the experiments_of interest variable. 

Step 3 begins with loading the magrittr package to get the %>% operator, which will allow piping. We then use R indexing with the binary columns_of_interest factor to select the names of columns we want and send it to the forcats as_factor() function to get a factor object for our grouping variable. Sample grouping information is basically a factor that tells us which samples are replications of the same thing and it's important for the experimental design description. You can see an expanded description of these grouping/factor objects in step 3 in Recipe 1.

In Step 4, we use indexing to extract the columns of data we want to actually analyze.

By Step 5, we are into the actual analysis section. First, we convert our matrix of counts into a DESeqDataSet object; this can be done with the conversion function, DESeqDataSetFromMatrix(), passing in the counts, the groups, and a design. The design is in the form of an R formula, hence, the ~ stage annotation.

In Step 6, we perform the actual analysis using the DESeq() function on the dds DESeqDataSet object and in Step 7, we get the results into the res variable using the results() function. The output has the following six columns:

baseMean log2FoldChange lfcSE stat pvalue padj

This shows the mean counts, the log2 fold change between samples for a gene, the standard error of the log2 fold change, the Wald statistic, and the raw and adjusted P value. The padj column for adjusted P values is the one most commonly used for concluding about significance.

Using DESeq2 from an ExpressionSet object

Steps 1 and 2 show how to do the same procedure starting from the eset object. It only takes two short steps because DESeq2 is set up to work a lot more nicely with Bioconductor objects than edgeR is. In step 8, we load the eset data with the load() function. Then we use the makeSummarizedExperimentFromExpressionSet() function from the SummarizedExperiment Bioconductor package to convert eset into SummarizedExperiment, which can be used directly in the DESeq() function in step 9. This step works exactly as steps 6 and 7


Power analysis with powsimR

An important preliminary to any experiment is assessing the power of the experimental design to optimize statistical sensitivity. In essence, a power analysis can tell us the number of replicates required to determine an effect size of a given magnitude for a given amount of experimental variability.

We'll use the powsimR package, which is not part of Bioconductor, to perform two types of power analysis. Both of these will be with a small real dataset, but first, we'll do it with two treatments—a test and control—then, we'll do it with just one. With each, we'll estimate the number of replicates we need to spot differences in gene expression of a particular magnitude—if they're present. powsimR takes a simulation-based approach, effectively generating many datasets and evaluating the detection power in each to create a distribution of detection power. The first step, then, is to estimate some parameters for these simulations—for this, we'll need some sample or preliminary data. After that, we can run simulations and assess power.

Getting ready

The dataset for this recipe will be a test or control RNAseq experiment from Arabidopsis with three replicates each. These are available as a prepared count matrix in datasets/ch1/arabidopsis.RDS in this book's data repository. In this section, we'll use a set of counts in a simple test or control experiment from Arabidopsis thaliana. The matrix has six columns (three mock treatments and three hrcc treatments) and 26,222 rows, each a gene feature. We'll need the dplyr, extRemes, and powsimR packages for this code.

Our package of interest, powsimR, isn't on CRAN; it's hosted as a source on GitHub at https://github.com/bvieth/powsimRYou'll need to use devtools to install it, which can be done using the following code:


If you do this, there is a chance that this package will still fail to install. It has a lot of dependencies and you might need to install those manually; there is further information on the package GitHub repository and you should check that for the latest information. At the time of writing, you'll need to do the following two big steps. First, create the ipak function outlined here, then run the three different package installation steps with the ipak function:

ipak <- function(pkg, repository = c("CRAN", "Bioconductor", "github")) {
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    # new.pkg <- pkg
    if (length(new.pkg)) {
        if (repository == "CRAN") {
            install.packages(new.pkg, dependencies = TRUE)
        if (repository == "Bioconductor") {
            if (strsplit(version[["version.string"]], " ")[[1]][3] > "3.5.0") {
                if (!requireNamespace("BiocManager")) {
                BiocManager::install(new.pkg, dependencies = TRUE, ask = FALSE)
            if (strsplit(version[["version.string"]], " ")[[1]][3] < "3.5.0") {
                biocLite(new.pkg, dependencies = TRUE, ask = FALSE)
        if (repository == "github") {
            devtools::install_github(new.pkg, build_vignettes = FALSE, force = FALSE, 
                dependencies = TRUE)

# CRAN PACKAGES cranpackages <- c("broom", "cobs", "cowplot", "data.table", "devtools", "doParallel", "dplyr", "drc", "DrImpute", "fastICA", "fitdistrplus", "foreach", "gamlss.dist", "ggExtra", "ggplot2", "ggthemes", "grDevices", "glmnet", "grid", "gtools", "Hmisc", "kernlab", "MASS", "MBESS", "matrixStats", "mclust", "methods", "minpack.lm", "moments", "msir", "NBPSeq", "nonnest2", "parallel", "penalized", "plyr", "pscl", "reshape2", "Rmagic", "rsvd", "Rtsne", "scales", "Seurat", "snow", "stats", "tibble", "tidyr", "VGAM", "ZIM")
ipak(cranpackages, repository = "CRAN") # BIOCONDUCTOR biocpackages <- c("AnnotationDbi", "bayNorm", "baySeq", "Biobase", "BiocGenerics", "BiocParallel", "DEDS", "DESeq2", "EBSeq", "edgeR", "IHW", "iCOBRA", "limma", "Linnorm", "MAST", "monocle", "NOISeq", "qvalue", "ROTS", "RUVSeq", "S4Vectors", "scater", "scDD", "scde", "scone", "scran", "SCnorm", "SingleCellExperiment", "SummarizedExperiment", "zinbwave") ipak(biocpackages, repository = "Bioconductor") # GITHUB githubpackages <- c("nghiavtr/BPSC", "cz-ye/DECENT", "mohuangx/SAVER", "statOmics/zingeR") ipak(githubpackages, repository = "github")

When this is done, you should be able to install the package we're after with this code:

devtools::install_github("bvieth/powsimR", build_vignettes = TRUE, dependencies = FALSE)
At the moment, for this to work, you also need to manually load dplyr

How to do it...

We will do the power analysis using the following steps:

  1. Estimate simulation parameter values:
arab_data <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis.RDS" ))
means_mock <- rowMeans(arab_data[, c("mock1", "mock2", "mock3")])
means_hrcc <- rowMeans(arab_data[, c("hrcc1", "hrcc2", "hrcc3")])
log2fc <- log2(means_hrcc / means_mock)
prop_de <- sum(abs(log2fc) > 2) / length(log2fc)
  1. Examine the distribution of the log2 fold change ratios:
finite_log2fc <-log2fc[is.finite(log2fc)]
extRemes::qqnorm(finite_log2fc )
  1. Set up parameter values for the simulation run:

params <- estimateParam(
countData = arab_data,
Distribution = "NB",
RNAseq = "bulk",
normalization = "TMM" # edgeR method, can be others

de_opts <- DESetup(ngenes=1000,
p.DE = prop_de,
pLFC= finite_log2fc,
sim.seed = 58673

sim_opts <- SimSetup(
desetup = de_opts,
params = params

num_replicates <- c(2, 3, 5, 8, 12,15)
  1. Run the simulation:
 simDE <- simulateDE(n1 = num_replicates,
n2 = num_replicates,
sim.settings = sim_opts,
DEmethod = "edgeR-LRT",
normalization = "TMM",
verbose = FALSE)
  1. Run the evaluation of the simulation:
 evalDE <- evaluateDE(simRes = simDE,
alpha.type = 'adjusted',
MTC = 'BH',
alpha.nominal = 0.1,
stratify.by = 'mean',
filter.by = 'none',
strata.filtered = 1,
target.by = 'lfc',
delta = 0)
  1. Plot the evaluation:
 plotEvalDE(evalRes = evalDE,
quick=FALSE, annot=TRUE)

How it works...

Power analysis in powsimR requires us to do some pre-analysis so that we have estimates for some important parameters. To perform a simulation-based power analysis, we need to estimate the distribution of log fold changes between treatments and the proportion of features that are differentially expressed.

In step 1, we'll get the mean counts for each feature in the two treatments. After loading the expression data using the readRDS() function, we use the rowMeans() function on certain columns to get the mean expression counts of each gene in both the mock and hrcc1 treatments. We can then get the log2 ratio of those (by simply dividing the two vectors and, in the last line, use standard arithmetical operators to work out those that have a log2 fold change greater than 2). Inspecting the final prop_de variable gives the following output:

## [1] 0.2001754

So, a proportion of about 0.2 of the features have counts changing by log2 twofold.

Step 2 looks at the distribution of the gene expression ratios. We first remove the non-finite ratios from the log2fc variable. We must do this because, when calculating ratios, we generate Inf values in R; this occurs when the denominator (the mock sample) has zero mean counts. We can remove them using indexing on the vector with the binary vector that comes from is.finite() function. With the Inf values removed, we can plot. First, we do a normal density plot using the density() function, which shows the distribution of ratios. Then, we use the qqnorm() function in the extRemes package, which plots the data against data sampled from an idealized normal distribution with the same mean. A strong, linear correlation indicates a normal distribution in the original data. We can see the output in the following screenshot:

They look pretty log-normally distributed, so we can assume a log-normal distribution. 

The longest step here, step 3, is actually only four lines. We are basically setting up the parameters for the simulation, which requires us to specify a lot of values. The first set, params, which we create with the estimateParam() function needs the data source (countData), the distribution to use (we set Distribution = "NB", which selects the negative binomial); the type of RNAseq experiment—ours is a bulk RNAseq experiment (RNAseq = "bulk"), and normalization strategy—we use the edgeR style TMM (normalization = "TMM"). The second set, desetup, is created with the DESetup() function; in this, we choose the parameters relating to the number of genes for which to simulate differential expression. We set up 1,000 total gene simulations (ngenes) and 25 simulation runs (nsims). We set the proportion to be differentially expressed to that estimated in step 1 in prop_de. We use the vector of fold changes, finite_log2fc, as input for the pLFC parameter. Setting sim.seed is not necessary but will ensure reproducibility between runs. The third line uses the SimSetup() function to combine params and desetup into a single object, sim_opts. Finally, we create a num_replicates vector specifying the number of biological replicates (RNA samples) to simulate.

Step 4 is relatively straightforward: we run the differential expression simulation using the sim_opts parameters created in the previous steps, choosing "edgeR-LRT" as the differential expression method and "TMM" as the normalization. The simulation data is stored in the simDE variable. 

In Step 5, we create an evaluation of the simulation—this analyzes and extracts various statistics. We pass the simDE simulation data to the evaluateDE() function along with values for things pertaining to grouping, filtering, and significance.

Finally, in Step 6, we can plot the evalDE object from Step 5 and see the results of the simulation. We get the following plot in which we can see the different powers at different replicate numbers. Note the x-axis indicates the number of replicate RNA samples used, and the metrics include FDR, False Negative/Positive Rate (FNR/FPR), and TNR/TPR (True Negative/Positive Rate):

There's more...

When we have only one sample (or maybe even just one replicate), we have a hard time estimating the log2 fold change distribution and the number of differentially expressed genes. In place of estimates, we can use a callback function to generate numbers as needed. The body of the function just needs to return numbers from a specified distribution with parameters you decide. Here, we'll build a function that returns numbers with a normal distribution of mean 0 and standard deviation 2. This reflects that we think the log fold change distribution is normal with these parameters. When we've built the function, it gets used in the DESetup() function in place of the vector of log2 fold changes. For the proportion of genes differentially expressed, we just have to guess or take an estimate from something we already know about the experimental system:

log2fc_func <- function(x){ rnorm(x, 0, 2)} 
prop_de = 0.1
de_opts <- DESetup(ngenes=1000,
p.DE = prop_de,
pLFC= log2fc_func,
sim.seed = 58673

Finding unannotated transcribed regions

A common challenge is to find and count reads that have aligned outside of annotated regions. In an RNAseq experiment, these reads can represent non-annotated genes and novel transcripts. Essentially, we have some genes we know about and can see that they are transcribed as they have aligned read coverage, but other transcribed regions do not fall in any annotations and we want to know the locations of the alignments of the reads representing them. In this recipe, we'll look at a deceptively straightforward technique for finding such regions.

Getting ready

Our dataset will be a synthetic one that has a small 6,000 bp genome region and two gene features with reads and a third unannotated region with aligning reads, as shown in the following screenshot:

We'll need the Bioconductor csaw, IRangesSummarizedExperiment, and rtracklayer libraries and some functions from other packages that are part of base Bioconductor. The data is in this book's data repository under datasets/ch1/windows.bam and datasets/ch1/genes.gff

How to do it...

Power analysis with powsimR can be done in the following steps:

  1. Set up a loading function:
get_annotated_regions_from_gff <- function(file_name) { 
gff <- rtracklayer::import.gff(file_name)
as(gff, "GRanges")
  1. Get counts in windows across the whole genome:
whole_genome <- csaw::windowCounts( 
    file.path(getwd(), "datasets", "ch1", "windows.bam"),
    bin = TRUE,
    filter = 0,
    width = 500,
    param = csaw::readParam(
        minq = 20,
        dedup = TRUE,
        pe = "both"
colnames(whole_genome) <- c("small_data")

annotated_regions <- get_annotated_regions_from_gff(file.path(getwd(), "datasets", "ch1", "genes.gff"))
  1. Find overlaps between annotations and our windows, and subset the windows:
windows_in_genes <-IRanges::overlapsAny( SummarizedExperiment::rowRanges(whole_genome), annotated_regions )
  1. Subset the windows into those in annotated and non-annotated regions:
annotated_window_counts <- whole_genome[windows_in_genes,] 
non_annotated_window_counts <- whole_genome[ ! windows_in_genes,]
  1. Get the data out to a count matrix:

How it works...

In step 1, we create a function that will load gene region information in a GFF file (see this book's Appendix for a description of GFF) and convert it into a Bioconductor GRanges object using the rtracklayer package. This recipe works because GRanges objects can be subset, just like a regular R matrix or dataframe. They're an object that is "matrix-like" in that respect and although GRanges is much more complicated than a matrix, it behaves much the same. This allows for some easy manipulations and extractions. We use GRanges extensively throughout this recipe, along with the related class, RangedSummarizedExperiment.

In step 2, we use the csaw windowCounts() function to get counts across the whole genome in 500 bp windows. The width parameter defines the window size, and the param parameter determines what constitutes a passing read; here, we set minimum read quality (minq) to a PHRED score of 20, remove PCR duplicates (dedup = TRUE), and require that both of the pairs of a read are aligned (pe="both"). The returned whole_genome object is RangedSummarizedExperiment. We set the name of the single data column in whole_genome to small_data. Finally, we use the custom function, get_annotated_regions_from_gff(), to make a GRanges object, annotated_regions, of the genes represented in our GFF file.

With Step 3, we use the IRanges overlapsAny() function to check whether the window locations overlap at all with the gene regions. This function requires GRanges objects, so we extract that from the whole_genome variable using the SummarizedExperiment rowRanges() function and pass that along with the existing GRanges object's annotated_regions to overlapsAny(). This returns a binary vector that we can use to do subsetting.

In step 4, we simply use the binary vector, windows_in_genes, to subset the whole_genome object, thereby extracting the annotated windows (into annotated_window_counts) as a GRanges object. Then, we can get the non-annotated windows with the same code but by logically inverting the binary vector using the ! operator. This gives us non_annotated_window_counts.

Finally, in step 5, we can extract the actual counts from the GRanges object using the assay() function.

There's more...

We may need to get annotated regions from other file formats than GFF. rtracklayer supports various formats—here's a function for working with BED files:

get_annotated_regions_from_bed <- function(file_name){ 
bed <- rtracklayer::import.bed(file_name)
as(bed, "GRanges")

Finding regions showing high expression ab initio with bumphunter

Finding regions of read alignments that all come from the same, potentially unannotated, genomic feature is a common task. The aim here is to group read alignments together in such a way that we will be able to mark regions that have significant coverage and then go on to compare samples for differences in expression levels. 

Getting ready...

We'll use the same windows dataset that had one experiment with three peaks into the function that we used in Recipe 4—so we know we're looking for three bumps. The data is in this book's data repository under datasets/ch1/windows.bam. We'll need the Rsamtools and bumphunter libraries.

How to do it...

  1. Load data and get per-position coverage:
pileup_df <- Rsamtools::pileup(file.path(getwd(), "datasets", "ch1", "windows.bam"))
  1. Find preliminary clusters:
clusters <- bumphunter::clusterMaker(pileup_df$seqnames, pileup_df$pos, maxGap = 100) 
  1. Find the bumps with a minimum cutoff:
bumphunter::regionFinder(pileup_df$count, pileup_df$seqnames, pileup_df$pos, clusters, cutoff=1)

How it works...

In step 1, we use Rsamtools pileup() function with default settings to get a per-base coverage dataframe. Each row represents a single nucleotide in the reference and the count column gives the depth of coverage at that point. The result is stored in the pileup_df dataframe.

In step 2, we use the bumphunter clusterMaker() function on pileup_df, which simply groups reads within a certain distance of each other into clusters. We give it the sequence names, positions, and a maximum distance parameter (maxGap). The function returns a vector of cluster numbers of equal length to the dataframe, indicating the cluster membership of each row in the dataframe. If we tabulate with table, we can see the cluster sizes (number of rows) in each cluster:

## clusters ## 1 2 3 ## 1486 1552 1520

In step 3, we refine our approach; we use regionFinder(), which applies a read depth cutoff to ensure a minimum read depth for the clusters. We pass it similar data as in step 2, adding the cluster membership vector clusters and a minimum read cutoff—here, we set to 1 for use with this very small dataset. The result of step 3 is the regions that are clustered together, but in a useful table:

##    chr start  end     value  area cluster indexStart indexEnd    L
## 3 Chr1  4503 5500 10.401974 15811       3       3039     4558 1520
## 1 Chr1   502 1500  9.985868 14839       1          1     1486 1486
## 2 Chr1  2501 3500  8.657216 13436       2       1487     3038 1552

In these region predictions, we can clearly see the three regions containing reads that are in that data, give or take a nucleotide or two.

There's more...

If you have multiple experiments to analyze, try the bumphunter() function. This will operate over multiple data columns in a matrix and perform linear modeling to assess uncertainty about the position and existence from the replicates; it is very similar to regionFinder() in operation.


Differential peak analysis

When you've discovered unannotated transcripts you may want to see whether they are differentially expressed between experiments. We've already looked at how we might do that with edgeR and DESeq, but one problem is going from an object such as a RangedSummarizedExperiment, comprised of the data and a GRanges object that describes the peak regions, to the internal DESeq object. In this recipe, we'll look at how we can summarise the data in those objects and get them into the correct format.

Getting ready

For this recipe, you'll need the RangedSummarizedExperiment version of the Arabidopsis thaliana RNAseq in datasets/ch1/arabidopsis_rse.RDS in this book's repository. We'll use the DESeq and SummarizedExperiment Bioconductor packages we used earlier too. 

How to do it...

  1. Load data and set up a function that creates region tags:
arab_rse <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis_rse.RDS") )

make_tag <- function(grange_obj){
[email protected],
[email protected]@start,
([email protected]@start + [email protected]@width)
  1. Extract data and annotate rows:
counts <- assay(arab_rse)

if ( ! is.null(names(rowRanges(arab_rse))) ){
  rownames(counts) <- names(rowRanges(arab_rse))
} else {
  rownames(counts) <- make_tag(rowRanges(arab_rse))

How it works...

Step 1 starts by loading in our pre-prepared RangedSummarized experiment; note that the names slot of the GRanges object in there is not populated. We next create a custom function, make_tag(), which works by pasting together seqnamesstarts and the computed end (start + width) from a passed GRanges object. Note the @ sign syntax: this is used because GRange is an S4 object and the slots are accessed with @ rather than the more familiar $

In step 2, the code pulls out the actual data from RangedSummarizedExperiment using the assay() function. The matrix returned has no row names, which is unuseful, so we use the if clause to check the names slot—we use that as row names if it's available; if it, isn't we make a row name tag using the position information in the GRanges object in the make_tag() function we have created. This will give the following outputa count matrix that has the location tag as the row name that can be used in DESeq and edgeR as described in Recipes 1 and 2 in this chapter:

## mock1 mock2 mock3 hrcc1 hrcc2 hrcc3 ## Chr1:3631-5900 35 77 40 46 64 60 ## Chr1:5928-8738 43 45 32 43 39 49 ## Chr1:11649-13715 16 24 26 27 35 20 ## Chr1:23146-31228 72 43 64 66 25 90 ## Chr1:31170-33154 49 78 90 67 45 60 ## Chr1:33379-37872 0 15 2 0 21 8

Estimating batch effects using SVA

High throughput data such as RNAseq is often complicated by technical errors that are not explicitly modeled in the experimental design and can confound the detection of differential expression. Differences in counts from samples run on different days or different locations or on different machines are an example of a technical error that is very often present and which we should try to model in our experimental design. An approach to this is to build a surrogate variable into our experimental design that explains the batch effect and take it into account in the modeling and differential expression analysis stages. We'll use the SVA package to do this.

Getting ready

We'll need the SVA package from Bioconductor and our Arabidopsis count data in datasets/ch1/arabidopsis.RDS.

How to do it...

For estimating batch effects using SVA, perform the following steps:

  1. Load the libraries and data:
arab <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis.RDS"))
  1. Filter out rows with too few counts in some experiments:
keep <- apply(arab, 1, function(x) { length(x[x>3])>=2 } )
arab_filtered <- arab[keep,]
  1. Create the initial design:
groups <- as.factor(rep(c("mock", "hrcc"), each=3))
  1. Set up the test and null models and run SVA:
test_model <- model.matrix(~groups)
null_model <- test_model[,1]
svar <- svaseq(arab_filtered, test_model, null_model, n.sv=1)
  1. Extract the surrogate variables to a new design for downstream use:
design <- cbind(test_model, svar$sv)

How it works...

In step 1, the script begins by loading in a count matrix of the Arabidopsis RNAseq data, which you will recall is a simple experiment with three replicates of mock and three of hrcc treatment.

In step 2, we create a vector of row indices that we wish to retain, which we do by testing whether the row has at least two columns with a count of over 3this is done by using apply() and an anonymous function over the rows of the count matrix.

With step 3, we create a groups factor describing the experiment sample types. 

Step 4 is the one that does the work; we use the groups factor in model.matrix() to create the model design we want to use. We also need a null model, which, in this experimental design, is equivalent to the first column, so we extract that from the test_model design object. We can then use the key svaseq() function to estimate the surrogate variable to add to our design. We add in test_model and null_model and select the number of surrogate variables to use with n.sv, which should be one for a simple design like this.

The final bit, step 5, is to add the surrogate variable to the design model, which we do by binding test_model and the sv column of svar (svsar$sv) together. The final design object can then be used in packages such as edgeR and DESeq2 as any other and those methods will use the surrogate variable to deal with batch effects.


Finding allele-specific expressions with AllelicImbalance

An allele-specific expression is a situation that occurs when there is a differential abundance of different allelic variants of a transcript. RNAseq can help to provide quantitative estimates of allele-specific expression for genes with transcribed polymorphismsthat is, variants in the transcript that may result in different proteins. In this recipe, we'll take a look at a method for determining which of the variants of a transcript may have preferential expressions in different samples. The reads will come from different .bam files and the variants must already be known. This implies that you have already carried out a read alignment and a variant call step and have per sample .bam and .vcf files. We'll use the AllelicImbalance and VariantAnnotation packages for this recipe.

Getting ready

You'll need AllelicImbalance and VariantAnnotation from Bioconductor. The AllelicImbalance package provides a small but informative dataset of three SNPs on Chromosome 17 of the hg19 build of the human genome. The files have been extracted into this book's data repository in datasets/ch1/allele_expression 

How to do it...

  1. Load libraries and set up an import folder:

region_of_interest <- GRanges(seqnames = c("17"), ranges = IRanges(79478301, 79478361)) 
bam_folder <- file.path(getwd(), "datasets", "ch1", "allele_expression")
  1. Load reads and variants in regions of interest:
reads <- impBamGAL(bam_folder, region_of_interest, verbose = FALSE)

vcf_file <-file.path( getwd(), "datasets", "ch1", "allele_expression","ERP000101.vcf" )
variant_positions <- granges(VariantAnnotation::readVcf(vcf_file), "hg19" )

allele_counts <- getAlleleCounts(reads, variant_positions, verbose=FALSE)
  1. Build the ASEset object:
ase.vcf <- ASEsetFromCountList(rowRanges = variant_positions, allele_counts)

reference_sequence <- file.path(getwd(), "datasets", "ch1", "allele_expression", "hg19.chr17.subset.fa")

ref(ase.vcf) <- refAllele(ase.vcf,fasta=reference_sequence)
alt(ase.vcf) <- inferAltAllele(ase.vcf)
  1. Run the test on all variants:
binom.test(ase.vcf, n="*")

How it works...

In step 1, the script begins by creating the familar GRanges object describing our region of interest and the folder holding the .bam files of reads.

In step 2, the impBamGAL() function loads in reads in the region of interest. The variant information is loaded into variant_positions—another GRanges object and the reads and variants are used to make allele counts with getAlleleCounts()

With this done, in step 3, we can build the ASESet object, ase.vcf (a class that inherits from RangedSummarizedExperiment), using the constructor function, ASEsetFromCountList(); we then use the setter functions, ref() and alt(), to apply the reference and alternative base identities.

Finally, in step 4, we can apply tests. binom.test() carries out binomial per position per sample (.bam file) tests for deviations from equality in counts in reference and alternative alleles. The parameter n tells the test which strand to consider—in this example, we haven't set up per-strand information, so we use "*" to ignore strandedness. 

This will give the following output:

##               chr17_79478287 chr17_79478331 chr17_79478334
## ERR009113.bam          0.500   1.000000e+00   1.000000e+00
## ERR009115.bam         0.125   6.103516e-05   3.051758e-05

There's more...

The preceding analysis can be extended to carry out per strand and per phenotype tests if required. The script would need amending to introduce strand information in the ASESet object construction step. Doing so usually requires that the RNAseq experiment and alignment steps were performed with strandedness in mind and the bioinformatics pipeline up to here configured accordingly. Phenotype information can be added in the construction step using the colData parameter and a vector of phenotype or sample types for columns in the ASESet object.


Plotting and presenting RNAseq data

Plotting the RNAseq data en masse or for individual genes or features is an important step in QC and understanding. In this recipe, we'll see how to make gene count plots in samples of interest, how to create an MA plot that plots counts against fold change and allows us to spot expression-related sample bias, and how to create a volcano plot that plots significance against fold change and allows us to spot the most meaningful changes easily.

Getting ready

In this recipe, we'll use the DESeq2 package, the ggplot2 package, magrittr, and dplyr. We'll use the DESeqDataSet object we created for the modencodefly data in Recipe 2a saved version is in the datasets/ch1/modencode_dds.RDS file in this book's data repository.

How to do it...

  1. Load libraries and create a dataframe of RNAseq results:

dds <- readRDS("~/Desktop/r_book/datasets/ch1/modencode_dds.RDS")
  1. Create a boxplot of counts for a single gene, conditioned on "stage":
plotCounts(dds, gene="FBgn0000014", intgroup = "stage", returnData = TRUE) %>%
  ggplot() + aes(stage, count) + geom_boxplot(aes(fill=stage)) + scale_y_log10() + theme_bw()
  1. Create an MA plot with coloring conditioned on significance:
result_df <- results(dds, contrast=c("stage","L2Larvae","L1Larvae"), tidy= TRUE) %>%

ggplot(result_df) + aes(baseMean, log2FoldChange) + geom_point(aes(colour=is_significant)) + scale_x_log10() + theme_bw()
  1. Create a volcano plot with coloring conditioned on significance:
ggplot(result_df) + aes(log2FoldChange, -1 * log10(pvalue))  + geom_point(aes(colour=is_significant)) + theme_bw()

How it works...

Step 1 is brief and loads the dataset and libraries we'll need.

In Step 2, we take advantage of a couple of useful parameters in the plotCounts() and results() functions from DESeq2. The returnData flag in plotCounts() will optionally return a tidy dataframe of count information for a given gene in a given condition, hence allowing us to send the data through ggplot() to make a boxplot for an individual gene. The magrittr %>% operator allows us to send the return value of plotCounts() straight to the first positional argument of ggplot() without saving in an intermediate variable.

In Step 3, we use the results() function from DESeq2 to get the results dataframe, which we pipe to dplyr mutate() in order to add a new column called is_significant containing TRUE if the value of the padj column is lower than 0.05. We then use the returned result_df dataframe in a ggplot() command to make a scatter plot of baseMean (count) against log2 fold change, with points colored by the is_significant variable, effectively colored by whether the P value is lower than 0.05 or not.

In Step 4, we use the same result_df dataframe to plot log2fold change against the negative log10 of the 'pvalue' to give a 'volcano' plot of the relationship between P and differential expression level:

The preceding three plots are the combined resultant output of these three ggplot() commands.

About the Author

  • Dan MacLean

    Professor Dan MacLean has a Ph.D. in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now a Honorary Professor in the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor and developing analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at The Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python with over 100,000 downloads combined.

    Browse publications by this author

Recommended For You

Python Machine Learning - Third Edition

Applied machine learning with a solid foundation in theory. Revised and expanded for TensorFlow 2, GANs, and reinforcement learning.

By Sebastian Raschka and 1 more
Advanced Deep Learning with Python

Gain expertise in advanced deep learning domains such as neural networks, meta-learning, graph neural networks, and memory augmented neural networks using the Python ecosystem

By Ivan Vasilev
Learning Geospatial Analysis with Python - Third Edition

Learn the core concepts of geospatial data analysis for building actionable and insightful GIS applications

By Joel Lawhead
Python GUI Programming Cookbook - Third Edition

Over 90 recipes to help you develop widgets, forms, layouts, charts, and much more using the latest features of Python 3

By Burkhard Meier