You're reading from Bayesian Analysis with Python. - Second Edition

Product type Book

Published in Dec 2018

Publisher Packt

ISBN-13 9781789341652

Pages 356 pages

Edition 2nd Edition

Languages

Python

Concepts

Data Science

Author (1):

Osvaldo Martin

Gaussian Processes

"Lonely? You have yourself. Your infinite selves."

- Rick Sanchez (at least the one from dimension C-137)

In the last chapter, we learned about the Dirichlet process, an infinite-dimensional generalization of the Dirichlet distribution that can be used to set a prior on unknown continuous distributions. In this chapter, we will learn about the Gaussian process, an infinite-dimensional generalization of the Gaussian distribution that can be used to set a prior on unknown functions. Both the Dirichlet process and the Gaussian process are used in Bayesian statistics to build flexible models where the number of parameters is allowed to increase with the size of the data.
In this chapter, we will cover the following topics:

Functions as probabilistic objects
Kernels
Gaussian processes with Gaussian likelihoods
Gaussian processes with non-Gaussian likelihoods...

Linear models and non-linear data

In Chapter 3, Modeling with Linear Regression, and Chapter 4, Generalizing Linear Models, we learned to build models of the general form:

Here, is a parameter for some probability distribution (for example, the mean of a Gaussian), the parameter of a binomial, the rate of a Poisson distribution, and so on. We call the inverse link function and is a function that is the square root or a polynomial function. For the simple linear regression case, is the identity function.

Fitting (or learning) a Bayesian model can be seen as finding the posterior distribution of the weights , and thus, this is known as the weight-view of approximating functions. As we have already seen, with the polynomial regression example, by letting be a non-linear function, we can map the inputs onto a feature space. We then fit a linear relation in the feature space...

Modeling functions

We will begin our discussion of Gaussian processes by first describing a way to represent functions as probabilistic objects. We may think of a function, _, as a mapping from a set of inputs, _, to a set of outputs, . Thus, we can write:

One way to represent functions is by listing for each value its corresponding value. In fact, you may remember this way of representing functions from elementary school:

x	y
0.00	0.46
0.33	2.60
0.67	5.90
1.00	7.91

As a general case, the values of and will live on the real line; thus, we can see a function as a (potentially) infinite and ordered list of paired (, ) values. The order is important because, if we shuffle the values, we will get different functions.

A function can also be represented as a (potentially) infinite array indexed by the values of , with the important distinction that the values of...

Gaussian process regression

Let's assume we can model a value as a function of plus some noise:

Here

This is similar to the assumption that we made in Chapter 3, Modeling with Linear Regression, for linear regression models. The main difference is that now we will put a prior distribution over . Gaussian processes can work as such a prior, thus we can write:

Here, represents a Gaussian process distribution, with being the mean function and the kernel, or covariance, function. Here, we have used the word function to indicate that, mathematically, the mean and covariance are infinite objects, even when, in practice, we always work with finite objects.

If the prior distribution is a GP and the likelihood is a normal distribution, then the posterior is also a GP and we can compute it analytically:

Here:

is the observed data point and represents the test points...

Regression with spatial autocorrelation

The following example is taken from the book, Statistical Rethinking, by Richard McElreath. The author kindly allowed me to reuse his example here. I strongly recommend reading his book, as you will find many good examples like this and very good explanations. The only caveat is that the book examples are in R/Stan, but don't worry and keep sampling; you will find the Python/PyMC3 version of those examples in https://github.com/pymc-devs/resources.

Well, going back to the example, we have 10 different island-societies; for each one of them, we have the number of tools they use. Some theories predict that larger populations develop and sustain more tools than smaller populations. Another important factor is the contact rates among populations.

As we have number of tools as dependent variable, we can use a Poisson regression with the...

Gaussian process classification

Gaussian processes are not restricted to regression. We can also use them for classification. As we saw in Chapter 4, Generalizing Linear Models, we turn a linear model into a suitable model to classify data by using a Bernoulli likelihood with a logistic inverse link function (and then applying a boundary decision rule to separate classes). We will try to recapitulate model_0 from Chapter 4, Generalizing Linear Models, for the iris dataset, this time using a GP instead of a linear model.

Let's invite the iris dataset to the stage one more time:

iris = pd.read_csv('../data/iris.csv')
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

We are going to begin with the simplest...

Cox processes

Let's now return to the example of modeling count data. We will see two examples; one with a time varying rate and one with a 2D-spatially varying rate. In order to do this, we will use a Poisson likelihood and the rate will be modeled using a Gaussian process. Because the rate of the Poisson distribution is limited to positive values, we will use an exponential as the inverse link function, as we did for the zero-inflated Poisson regression from Chapter 4, Generalizing Linear Models.

In the literature, the variable rate also appears with the name intensity; thus, this type of problem is known as intensity estimation. Also, this type of model is often referred to as a Cox model. A Cox model is a type of Poisson process, where the rate is itself a stochastic process. Just as a Gaussian process is a collection of random variables, where every finite collection...

Summary

A Gaussian process is a generalization of the multivariate Gaussian distribution to infinitely many dimensions and is fully specified by a mean function and a covariance function. Since we can conceptually think of functions as infinitely long vectors, we can use Gaussian processes as priors for functions. In practice, we do not work with infinite objects but with multivariate Gaussian distributions with as many dimensions as data points. To define their corresponding covariance function, we used properly parameterized kernels; and by learning about those hyperparameters, we ended up learning about arbitrary complex functions.

In this chapter, we have given a short introduction to GPs. We have covered regression, semi-parametric models (the islands example), combining two or more kernels to better describe the unknown function, and how a GP can be used for classification...

Exercises

For the example in the Covariance functions and kernels section make sure you understand the relationship between the input data and the generated covariance matrix. Try using other input such as data = np.random.normal(size=4)
Rerun the code generating Figure 7.3 and increase the number of samples obtained from the GP-prior to around 200. In the original figure the number of samples is 2. Which is the range of the generated values?
For the generated plot in the previous exercise. Compute the standard deviation for the values of at each point. Do this in the following form:

- Visually, just observing the plots
- Directly from the values generated from stats.multivariate_normal.rvs
- By inspecting the covariance matrix (if you have doubts go back to exercise 1)

Did the values you get from these 3 methods agree?