Reader small image

You're reading from  R Bioinformatics Cookbook - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781837634279
Edition2nd Edition
Right arrow
Author (1)
Dan MacLean
Dan MacLean
author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean

Right arrow

Using dplyr to summarize data in large tables

Split-apply-combine is a technique used in data science to analyze and manipulate large datasets by breaking them down into smaller, more manageable pieces, applying a function or operation to each piece, and then combining the results. It’s a powerful method for working with data because it allows you to process and analyze data in a way that is both efficient and interpretable. The process can be repeated multiple times to gain deeper insights into the data.

In the tidyverse, the dplyr package provides a set of tools for implementing the split-apply-combine technique; we’ll look at those in this recipe.

Getting ready

We will need the dplyr and tidyr packages for this recipe.

How to do it…

The functionality of the dplyr package for split-apply-combine techniques is shown in the following steps:

  1. Create the initial data frame:
    chromosome_id <- c(1,1,1,2,2,3,3,3)gene_id <- c("A1","A2","A3","B1","B2","C1","C2","C3")strand <- c("forward","reverse","forward","forward",            "reverse","forward","forward","reverse")length <- c(2000,1500,3000,2500,2000,1000,2000,3000)genes_df <- data.frame(chromosome_id,gene_id,strand,length)
  2. Group on a single column:
    library(dplyr)genes_df |>   group_by(chromosome_id) |>   summarise(total_length = sum(length))
  3. Group and summarize on multiple columns:
    genes_df |>   group_by(chromosome_id, strand) |>   summarise(    num_genes = n(),    avg_length = mean(length)    )
  4. Work on a nested data frame:
    # Create a nested dataframechromosome_id <- c(1,1,1,2,2,3,3,3)gene_id <- c("A1","A2","A3","B1","B2","C1","C2","C3")strand <- c("forward","reverse","forward","forward","reverse","forward","forward","reverse")length <- c(2000,1500,3000,2500,2000,1000,2000,3000)genes_df <- data.frame(chromosome_id,gene_id,strand,length)genes_df$samples <- list(data.frame(sample_id=1:2, expression=c(2,3)),                        data.frame(sample_id=1:3, expression=c(3,4,5)),                        data.frame(sample_id=1:2, expression=c(4,5)),                        data.frame(sample_id=1:3, expression=c(5,6,7)),                        data.frame(sample_id=1:2, expression=c(6,7)),                        data.frame(sample_id=1:2, expression=c(1,2)),                        data.frame(sample_id=1:2, expression=c(2,3)),                        data.frame(sample_id=1:2, expression=c(3,4))                       )genes_df |>   tidyr::unnest() |>   group_by(chromosome_id,strand) |>   summarise(mean_expression = mean(expression))

These are a broad set of examples for the use of split-apply-combine in dplyr.

How it works…

Step 1 explicitly creates a data frame; we do it this way so that we can easily understand its structure.

In step 2, we use the method in its simplest form: the group_by() function is used to group the rows of a data frame based on the chromosome_id, and then we use summarise() to return a summary data frame. Step 3 is similar but shows how multiple grouping columns can be used to create more granular groups and how more than one summary function can be applied.

Step 4 is more complex; the new data frame is a nested data frame, and the samples list column contains a data frame of expression data in each cell. When using group_by() and summarise() functions on a nested data frame, you first need to access the nested data using the tidyr::unnest() function, then group and summarize as usual. Note that when using tidyr::unnest(), the new data frame will have multiple rows for each gene, one for each sample, so it’s important to group the data frame by the columns of interest.

Previous PageNext Page
You have been reading a chapter from
R Bioinformatics Cookbook - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781837634279
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean