Bayesian Regression

We will cover the following recipes in this chapter:

Getting the posterior density in STAN
Formulating a linear regression model
Assigning the priors
Doing MCMC the manual way
Evaluating convergence with CODA
Bayesian variable selection
Using a model for prediction
GLMs in JAGS

Introduction

In this chapter, we present several Bayesian techniques in R, using either STAN or JAGS (both are the most important software packages that can be used in R). Bayesian statistics is fundamentally different from classical statistics. In the latter, parameters are fixed quantities that need to be found. In the Bayesian framework, parameters are random variables themselves that can be learned. Furthermore, Bayesian statistics allows us to incorporate prior knowledge about a distribution that we want to learn, and update it accordingly.

Getting the posterior density in STAN

STAN is the leading Bayesian engine for R, both for academia and the industry. Its performance is very good, mainly because it is written in C++.

In Bayesian statistics, and we have a very different approach than in classical statistics. Here, each coefficient will behave as a random variable, and we will use appropriate algorithms to recover the distribution of each one of them. But there is an extra element here, we will be able to incorporate prior distributions into our approach. Consequently, the idea will be the following:

Bayesian statistics could be interpreted as an approach where we have a prior/initial idea about a coefficient, we then augment that expectation using the data, and we finally end up with a posterior distribution. This is not that different from the process humans follow when learning new things; for example, we...

Formulating a linear regression model

The mechanics for Bayesian linear regression follow the same logic as that which was described in the previous chapter. The only real difference is that we will specify a distribution for the residuals, which will be distributed according to a Gaussian distribution, with 0 mean and a certain variance. These residuals will originate as the subtraction of the actual values, minus the expected ones. These expected values will be equal to the sum of several coefficients times certain variables.

In a linear regression context, we want to build inferences on the coefficients. But here (as we have already mentioned), we will estimate a density for each posterior.

Getting ready

In order to run...

Assigning the priors

As we know, the priors are ingested by the MCMC algorithm, and are used to calculate the posterior densities. But how should the priors be assigned? Do we actually need a prior for each parameter?

Defining the support

Priors are just statistical distributions that reflect the initial expectation that the modeler has about each parameter. The very first thing we need to decide is, what is the support for the corresponding distributions? For example, for most coefficients in a linear regression model, the modeler very likely knows the correct sign for them. When modeling sales of a product in terms of its price and a promotional effect, the price effect should be negative (a higher price = less sales), and...

Doing MCMC the manual way

In this recipe, we will go through a full example of coding an MCMC algorithm ourselves. This will give us a much better grasp of MCMC mechanics.

In the Bayesian world, we put prior densities and use data to augment those priors, and get posterior densities. The problem is that there are only a few occasions where we can calculate those posterior densities analytically—these are called conjugate families.

The Bayesian problem can be formulated as recovering the conditional density of the parameter given data. This is equal to the ratio of the joint density of the parameters and the data divided by the marginal density of the data. This follows from Bayes theorem, which states that we can invert a conditional probability by dividing the joint probability by the appropriate marginal density. This is the density that we want to compute, but even if...

Evaluating convergence with CODA

The Convergence and Diagnostics (CODA) package is frequently used to evaluate the convergence of MCMC output. It provides several statistical tests to test whether MCMC chains have converged. Many prominent statisticians argue that convergence diagnostics should only be used to flag obvious problems with MCMC convergence, but can't be used to authoritatively tell whether MCMC chains have converged.

Remember that MCMC is an algorithm that generates correlated random numbers according to a particular distribution (in this case, our posterior distribution) only when the stationary distribution has been achieved. Consequently, we need to check the following two things:

That the stationary distribution has been achieved. This is almost always not that simple, since we can never authoritatively tell whether that distribution has been achieved...

Bayesian variable selection

Bayesian variable selection within a classical context is usually simple. It really boils down to selecting an appropriate metric (such as the AIC or p-values) and evaluating the model in a greedy way; starting with either a simple (or complex) model, and seeing what happens when we add (or remove) terms.

In a Bayesian context, things are not that easy, since we are not treating parameters as fixed values. We are estimating a posterior density, but a density itself has no significance so we can no longer remove them based on p-values. The AIC way can't be used either, as we don't have an AIC value, but a distribution of possible AICs.

Clearly, we need a different way of doing variable selection that takes into consideration that we are dealing with densities. Kuo and Mallick (https://www.jstor.org/stable/25053023?seq=1#page_scan_tab_contents...

Using a model for prediction

Once we have trained a model and recovered the marginal posterior densities, we will probably want to use our model for predicting/scoring new samples. This is not as easy as in the classical approach, because our parameters are no longer fixed values, but distributions. This means that the predictions won't be point estimates/values, but a range of possible values, each one of them with an associated probability.

Getting ready

We will use STAN, which can be installed via install.packages("rstan").

How to do it...

We will use...

GLMs in JAGS

GLMs stand for Generalized Linear Models. It is a generalization of the linear model (that assumes normality) to other distributions of the so-called exponential family (the Gaussian one is also part of this family). This model formulation allows us to fit models using several responses for the dependent variable such as binary, categorical, count, and more. For example, logistic and Poisson regression are two models that are part of this family.

In this example, we will do Bayesian logistic regression (one type of GLM). This model is appropriate when modeling a categorical response that takes two possible values. Possible examples could be modeling whether a customer is going to buy a product or not, or a student is going to pass an exam.

Both STAN and JAGS can handle not only linear regression models, but a wide array of regression models. In this exercise, we will...