Reader small image

You're reading from  R Bioinformatics Cookbook - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781837634279
Edition2nd Edition
Right arrow
Author (1)
Dan MacLean
Dan MacLean
author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean

Right arrow

Easily Performing Statistical Tests Using Linear Models

Linear models are a statistical tool used to model the relationship between a dependent variable and one or more independent variables. They are based on the assumption that the relationship between the variables is linear, meaning that the change in the dependent variable is proportional to the change in the independent variables.

Linear models are widely used in many fields, including bioinformatics. In bioinformatics, linear models can be used to analyze large datasets, such as gene expression data. For example, linear models can be used to identify differentially expressed genes between different experimental conditions or to predict the expression of genes based on other variables, such as clinical data.

Linear models are closely related to statistical tests, such as t-tests and analysis of variance (ANOVA). In fact, t-tests and ANOVA can be seen as special cases of linear models. For example, a two-sample t-test is...

Technical requirements

We will use renv to manage packages in a project-specific way. To use renv to install packages, you will first need to install the renv package. You can do this by running the following commands in your R console:

  1. Install renv:
    install.packages("renv")
  2. Create a new renv environment:
    renv::init()

    This will create a new directory called .renv in your current project directory.

  3. You can then install packages with the following:
    renv::install_packages()
  4. You can also use the renv package manager to install Bioconductor packages by running the following command:
    renv::install("bioc::package name")
  5. For example, to install the Biobase package, you would run the following:
    renv::install("bioc::Biobase")
  6. You can use renv to install development packages from GitHub with this command:
    renv::install("user name/repo name")
  7. For example, to install the user danmaclean package rbioinfcookbook, you would run the following...

Modeling data with a linear model

Linear models are a type of statistical model used to analyze the relationship between a dependent variable and one or more independent variables. In essence, they seek to fit a line that best describes the relationship between these variables, allowing us to make predictions about the dependent variable based on the values of the independent variables. The equation for a simple linear model can be written as follows:

y = β 0 + β 1 x + ε

where y is the dependent variable, x is the independent variable, β 0 and β 1 are coefficients that represent the intercept and slope of the line, respectively, and ε is the error term.

The output of a linear model typically includes the coefficients of the model, which describe the strength and direction of the relationship between the variables, as well as measures of the model’s goodness of fit, such as the R-squared value.

Linear models...

Using a linear model to compare the mean of two groups

The t-test is a statistical method used to help us decide whether there is likely to be a difference between the means of two groups. t-tests are probably the most widely used tests in bioinformatics and biology, usually applied without consideration as to whether the assumptions of the test hold and can be intepreted without criticism. By learning how to do the t-test through building a linear model, you will be able to test whether the assumptions hold since a well fit model implies a good fit to the assumptions. The t-test is a special case of the linear model because it can be framed as a linear regression problem with a binary predictor variable.

In the linear model, we try to fit a linear equation that describes the relationship between a response output variable (dependent variable) and one or more predictor input variables (independent variables). In the case of a t-test, we have one binary predictor variable, which...

Using a linear model and ANOVA to compare multiple groups in a single variable

ANOVA is a statistical method used to test whether there is a significant difference between two or more groups. ANOVA compares the variance within groups to the variance between groups to determine if there is a statistically significant difference in the means of the groups. ANOVA is commonly used in experiments where a response variable is measured across several groups under different experimental conditions.

ANOVA can be used to compare gene expression levels across multiple samples under different experimental conditions, the response variable is the gene expression level, and the categorical variable is the experimental condition. ANOVA can also be used in clinical trials to compare the effectiveness of different treatments or interventions for a disease or medical condition.

Linear models can be used to perform ANOVA by fitting a linear model to the data with a categorical variable that represents...

Using linear models and ANOVA to compare multiple groups in multiple variables

Two-way ANOVA is a statistical method used to analyze the effects of two categorical independent variables, also known as factors, on a continuous dependent variable. The two independent variables can be either fixed or random.

The main purpose of two-way ANOVA is to examine whether there is a significant interaction between the two independent variables, as well as to determine the main effects of each independent variable on the dependent variable.

The analysis involves calculating the sum of squares for each of the effects and the interaction and comparing these values to their respective degrees of freedom to obtain F ratios. The F ratios are then compared to critical values from an F-distribution to determine whether the effects are statistically significant.

Like the one-way ANOVA seen in the Using a linear model and ANOVA to compare multiple groups in a single variable recipe, the basis is...

Testing and accounting for interactions between variables in linear models

An interaction between variables occurs when the effect of one predictor variable on the response variable depends on the level of another predictor variable. In other words, the effect of one variable is not constant across different levels of the other variable. The interaction can occur between different drug regimes in medical trials or generally multiple experimental conditions being changed.

Linear models can model interactions by including interaction terms in the model formula. An interaction term is the product of two or more predictor variables, where each predictor variable is centered to have a mean of zero.

Suppose we have a linear regression model with two predictor variables, x1 and x2, and we want to examine their interaction. The interaction term can be included in the model as follows:

y = β 0 + β 1 x 1 + β 2 x 2 + β 3(x 1...

Doing tests for differences in data in two categorical variables

Categorical output variables, also known as response variables or dependent variables, are variables that take on discrete values from a finite set of possible outcomes. We can consider that there are two types of categorical variables: nominal and ordinal.

Ordinal variables have a natural ordering among the categories. Examples of ordinal variables include education level, income bracket, and satisfaction ratings. In linear models, ordinal variables can be represented using their numerical values or by assigning each category a numerical rank. For example, in a linear model predicting job satisfaction based on salary, the ordinal variable income bracket could be assigned a numerical rank from one to five based on the size of the income range. Ranking helps us to use the linear model framework fairly easily.

Nominal variables are variables that have no inherent order or ranking among the categories. Examples of...

Making predictions using linear models

Linear models are commonly used in bioinformatics for prediction tasks due to their simplicity, interpretability, and ability to handle high-dimensional datasets. In bioinformatics, researchers often work with large datasets that have a large number of features (such as gene expression data or sequence data), making it challenging to analyze them with more complex models. Linear models offer a straightforward and computationally efficient way to analyze these datasets. Linear models can help researchers identify genes or genetic variants that are associated with a particular trait or disease. They can also be used in feature selection, which is an important step in bioinformatics data analysis. Feature selection aims to identify the most relevant features (genes, proteins, etc.) that are associated with the outcome of interest (disease, drug response, etc.). Linear models can be used to rank features based on their importance and select the most...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Bioinformatics Cookbook - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781837634279
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan MacLean

Professor Dan MacLean has a PhD in molecular biology from the University of Cambridge and gained postdoctoral experience in genomics and bioinformatics at Stanford University in California. Dan is now an honorary professor at the School of Computing Sciences at the University of East Anglia. He has worked in bioinformatics and plant pathogenomics, specializing in R and Bioconductor, and has developed analytical workflows in bioinformatics, genomics, genetics, image analysis, and proteomics at the Sainsbury Laboratory since 2006. Dan has developed and published software packages in R, Ruby, and Python, with over 100,000 downloads combined.
Read more about Dan MacLean