Reformatting and extracting existing data into new columns using stringr
Text manipulation is important in bioinformatics as it allows, among other things, for efficient processing and analysis of DNA and protein sequence annotation data. The R stringr package is a good choice for text manipulation because it provides a simple and consistent interface for common string operations, such as pattern matching and string replacement. stringr is built on top of the powerful stringi manipulation library, making it a fast and efficient tool for working with arbitrary strings. In this recipe, we’ll look at rationalizing data held in messy FAST-All (FASTA)-style sequence headers.
Getting ready
We’ll use the Arabidopsis gene names in the ath_seq_names vector provided by the rbioinfcookbook package and the stringr package.
How to do it…
To reformat gene names using stringr, we can proceed as follows:
- Capture the
ATxGxxxxxformat IDs:library(rbioinfcookbook)library(stringr)ids <- str_extract(ath_seq_names, "^AT\\dG.*\\.\\d")
- Separate the string into elements and extract the description:
description <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,3] |> str_trim()
- Separate the string into elements and extract the gene information:
info <- str_split(ath_seq_names, "\\|", simplify = TRUE)[,4] |> str_trim()
- Match and recall the chromosome and coordinates:
chr <- str_match(info, "chr(\\d):(\\d+)-(\\d+)")
- Find the number of characters the strand information begins at and use that as an index:
strand_pos <- str_locate(info, "[FORWARD|REVERSE]")strand <- str_sub(info, start=strand_pos, end=strand_pos+1)
- Extract the length information:
lengths <- str_match(info, "LENGTH=(\\d+)$")[,2]
- Combine all captured information into a data frame:
results <- data.frame( ids = ids, description = description, chromosome = as.integer(chr[,2]), start = as.integer(chr[,3]), end = as.integer(chr[,4]), strand = strand, length = as.integer(lengths))
And that gives us a very nice, reformatted data frame.
How it works…
The R code uses the stringr library to extract, split, and manipulate information from a vector of sequence names (ath_seq_names) and assigns the resulting information to different variables. The rbioinfcookbook library provides the initial ath_seq_names vector.
The first step of the recipe uses the str_extract() function from stringr to extract a specific pattern of characters. The "^AT\dG.*.\d" regex matches any string that starts with "AT", followed by one digit, then "G", then any number of characters, then a dot, and finally one digit. stringr operations are vectorized so that all entries in them are processed.
Steps 2 and 3 are similar and use the str_split() function to split the seq_names vector by the "|" character; the simplify option returns a matrix of results with a column for each substring. The str_trim() function removes troublesome leading and trailing whitespace from the resulting substring. The third and fourth columns of the resulting matrix are saved.
The following line of code uses the str_match() function to extract specific substrings from the info variable that match the "chr(\d):(\d+)-(\d+)" regex. This regex matches any string that starts with "chr", followed by one digit, then ":", then one or more digits, then "-", and finally one or more digits. The '()' bracket symbols mark the piece of text to save; each saved piece goes into a column in the matrix.
The next line of code uses the str_locate() function to find the position of the first occurrence of either FORWARD or REVERSE in the info variable. The resulting position is then used to extract the character at that position using str_sub(). The last line of code uses the str_match() function to extract the substring that starts with "LENGTH=" and ends with one or more digits from the info variable.
Finally, the code creates a data frame result by combining the extracted and subsetted variables, assigning appropriate types for each column.