Reader small image

You're reading from  R Bioinformatics Cookbook - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781837634279
Edition2nd Edition
Right arrow
Author (1)
Dan MacLean
Dan MacLean
author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean

Right arrow

Phylogenetic Analysis and Visualization

Phylogenetics is the study of the evolutionary relationships among species or other groups of organisms. It involves the use of molecular and computational techniques to construct phylogenetic trees, which depict the evolutionary history of the organisms under study.

In bioinformatics, phylogenetics is studied using various computational tools and methods, including sequence alignment, distance-based methods, maximum likelihood, and Bayesian inference. These methods allow researchers to compare DNA or protein sequences from different organisms and infer their evolutionary relationships based on similarities and differences in their genetic makeup. Phylogenetics has many applications in biology, and is used to help understand the evolutionary history of species, to study the origins and spread of diseases (phylogenetic analysis can be used to trace the origins and spread of infectious diseases), and to inform conservation efforts by identifying...

Technical requirements

We will use renv to manage packages in a project-specific way. To use renv to install packages, you will first need to install the renv package. You can do this by running the following commands in your R console:

  1. Install renv:
    install.packages("renv")
  2. Create a new renv environment:
    renv::init()

    This will create a new directory called .renv in your current project directory.

  3. You can then install packages with the following:
    renv::install_packages()
  4. You can also use the renv package manager to install Bioconductor packages by running the following command:
    renv::install("bioc::package name")
  5. For example, to install the Biobase package, you would run this:
    renv::install("bioc::Biobase")
  6. You can use renv to install development packages from GitHub like this:
    renv::install("user name/repo name")
  7. For example, to install the danmaclean user rbioinfcookbook package, you would run this:
    renv::install("danmaclean...

Reading and writing varied tree formats with ape and treeio

Phylogenetic analysis is a cornerstone of biology and bioinformatics. The programs are diverse and complex, the computations are long-running, and the datasets are often large. Many programs are standalone and many have proprietary input and output formats. This has created a very complex ecosystem that we must navigate when dealing with phylogenetic data, meaning that often the simplest strategy is to use combinations of tools to load, convert, and save the results of analyses in order to be able to use them in different packages. In this recipe, we’ll look at dealing with phylogenetic tree data in R. To date, R support for the wide range of tree formats is restricted, but a few key packages have sufficient standardized objects such that workflows can focus on a few types and conversion to those types is streamlined. We’ll look at using the ape and treeio packages to get tree data into and out of R.

Getting...

Visualizing trees of many genes quickly with ggtree

Once you have computed a tree, the first thing you will want to do with it is take a look. That’s possible in many programs, but R has an extremely powerful, flexible, and fast system in the form of the ggtree package. In this recipe, we’ll learn how to get into ggtree and re-layout, highlight, and annotate tree images in just a few commands.

Getting ready

You’ll need the ggplot2, ggtree, and ape packages. You’ll also require the itol.nwk file from the rbioinfcookbook package. The file is a Newick tree of 191 species from the Interactive Tree of Life online tool’s public dataset. At the time of writing, there is an issue with an upstream dependency that causes this code to fail, though it is correct. We hope this will have gone away by the time you read this. If it hasn’t, a workaround is to install the source version of ggtree from Biocmanager, like this:

BiocManager::install("...

Quantifying and estimating the differences between trees with treespace

Comparing trees to differentiate or group them can help researchers to see patterns of evolution. Multiple trees of a single gene tracked across species or strains can reveal differences in how that gene is changing across species. At the core of these approaches are metrics of distances between trees. In this recipe, we’ll calculate one such metric to find pairwise differences between 20 different genes in 15 different species, hence 15 different tips with identical names in each tree. Such similarity in trees is needed to compare and get distances, and we can’t do an analysis like this unless these conditions are met.

Getting ready

For this recipe, we’ll use the treespace package to compute distances and clusters. We’ll use ape and adegraphics for accessory loading and visualization functions. The input data will be 20 files of Newick format trees, each of which represents a...

Extracting and working with subtrees using ape

A common but often frustrating task is cropping trees to look at a section in a new, clearer context or combining them with another tree in order to present two distant clades more clearly. In this short recipe, we’ll look at how easy it can be to manipulate trees- specifically, how to pull out a subtree as a new object and how to combine trees into other trees. We’ll use the ape package, the phylogenetic workhorse in R that will give us functionality for completing those tasks easily.

Getting ready

We’ll need a single example tree – the mammal_tree.nwk file in the rbioinfcookbook package will be fine. All the functions we require can be found in the ape package.

How to do it…

Extracting and working with subtrees in ape can be executed using the following steps:

  1. Load the library and tree:
    library(ape)tree_file <- fs::path_package(  "extdata",  "...

Creating dot plots for alignment visualizations

Dot plots of pairs of aligned sequences are possibly the oldest alignment visualization. In these plots, the positions of two sequences are plotted on the x axis and y axis, and for every coordinate in that space, a point is drawn if the letters (nucleotides or amino acids) correspond at that (x,y) coordinate. Since the plot can show regions that match that aren’t generally in the same region of the two sequences (as lines away from the diagonal), the plot is a good way to visually spot insertions and deletions and structural rearrangements in the two sequences. In this recipe, we’ll look at a speedy method for constructing a dot plot using the dotplot package and a bit of code for getting a grid plot of all pairwise dot plots for sequences in a file.

Getting ready

We’ll need the bhlh.fa file, which contains three basic helix-loop-helix (bHLH) transcription factor sequences from pea, soy, and lotus. The file...

Reconstructing trees from alignments using phangorn

So far in this chapter, we’ve assumed that trees are already available and ready to use. Of course, there are many ways to make a phylogenetic tree and, in this recipe, we’ll take a look at some of the different methods available.

Getting ready

For this chapter, we’ll use the abc.fa file of yeast ABC transporter sequences, the Bioconductor Biostrings package, and the CRAN msa and phangorn packages.

How to do it…

Constructing trees using phangorn can be done like this:

  1. Load in the libraries and sequences and make an alignment:
    library(Biostrings)library(msa)library(phangorn)seqfile <- fs::path_package(  "extdata",  "abc.fa",  package="rbioinfcookbook")seqs <- readAAStringSet(seqfile)aln <- msa::msa(seqs, method=c("ClustalOmega"))
  2. Convert the alignment:
    aln <- as.phyDat(aln, type = "AA")
  3. Make...

Finding orthologue candidates using reciprocal BLASTs

In genomics, orthology refers to the relationship between genes from different species that evolved from a common ancestral gene through speciation. Orthologous genes typically have the same function and structure and play similar roles in different organisms, even if they have diverged over time.

Orthology has many important uses in bioinformatics. Orthology can be used to infer the function of a gene in a newly sequenced genome based on its similarity to known genes in other species. This can be especially useful for identifying genes that are involved in specific biological processes or pathways. Orthologous genes can be used to compare the genomes of different organisms and study the evolution of gene families. By identifying which genes are conserved across different species, researchers can gain insights into the evolutionary history of those genes and the organisms that carry them.

Orthology can be inferred using various...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Bioinformatics Cookbook - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781837634279
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean