Search icon
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Learning Hub
Free Learning
Arrow right icon
Over 7,000 tech titles at $9.99 each with AI-powered learning assistants on new releases
Enhancing Deep Learning with Bayesian Inference
Enhancing Deep Learning with Bayesian Inference

Enhancing Deep Learning with Bayesian Inference: Create more powerful, robust deep learning systems with Bayesian deep learning in Python

By Matt Benatan , Jochem Gietema , Marian Schneider
$49.99 $9.99
Book Jun 2023 386 pages 1st Edition
$49.99 $9.99
$15.99 Monthly
$49.99 $9.99
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Enhancing Deep Learning with Bayesian Inference

Chapter 2
Fundamentals of Bayesian Inference

Before we get into Bayesian inference with Deep Neural Networks (DNNs), we should take some time to understand the fundamentals. In this chapter, we’ll do just that: exploring the core concepts of Bayesian modeling, and taking a look at some of the popular methods used for Bayesian inference. By the end of this chapter, you should have a good understanding of why we use probabilistic modeling, and what kinds of properties we look for in well principled – or well conditioned – methods.

This content will be covered in the following sections:

  • Refreshing our knowledge of Bayesian modeling

  • Bayesian inference via sampling

  • Exploring the Gaussian processes

2.1 Refreshing our knowledge of Bayesian modeling

Bayesian modeling is concerned with understanding the probability of an event occurring given some prior assumptions and some observations. The prior assumptions describe our initial beliefs, or hypothesis, about the event. For example, let’s say we have two six-sided dice, and we want to predict the probability that the sum of the two dice is 5. First, we need to understand how many possible outcomes there are. Because each die has 6 sides, the number of possible outcomes is 6 × 6 = 36. To work out the possibility of rolling a 5, we need to work out how many combinations of values will sum to 5:


Figure 2.1: Illustration of all values summing to five when rolling two six-sided dice

As we can see here, there are 4 combinations that add up to 5, thus the probability of having two dice produce a sum of 5 is -4 36, or 1 9. We call this initial belief the prior. Now, what happens if we incorporate information from an observation? Let’s say we know what the value for one of the dice will be – let’s say 3. This shrinks our number of possible values down to 6, as we only have the remaining die to roll, and for the result to be 5, we’d need this value to be 2.


Figure 2.2: Illustration of remaining value, which sums to five after rolling the first die

Because we assume our die is fair, the probability of the sum of the dice being 5 is now 1 6. This probability, called the posterior, is obtained using information from our observation. At the core of Bayesian statistics is Bayes’ rule (hence ”Bayesian”), which we use to determine the posterior probability given some prior knowledge. Bayes’ rule is defined as:

P(A |B ) = P(B-|A)×-P-(A)- P(B )

Where we can define P(A|B) as P(d1 + d2 = 5|d1 = 3), where d1 and d2 represent dice 1 and 2 respectively. We can see this in action using our previous example. Starting with the likelihood, that is, the term on the left of our numerator, we see that:

1- P (B |A) = P (d1 = 3|d1 + d2 = 5) = 4

We can verify this by looking at our grid. Moving to the second part of the numerator – the prior – we see that:

 4 1 P(A ) = P (d1 + d2 = 5) =--= -- 36 9

On the denominator, we have our normalization constant (also referred to as the marginal likelihood), which is simply:

P(B ) = P (d1 = 3) = 1 6

Putting this all together using Bayes’ theorem, we have:

 14 × 19 1 P(d1 + d2 = 5|d1 = 3) = --1---= 6- 6

What we have here is the probability of the outcome being 5 if we know one die’s value. However, in this book, we’ll often be referring to uncertainties rather than probabilities – and learning methods to obtain uncertainty estimates with DNNs. These methods belong to a broader class of uncertainty quantification, and aim to quantify the uncertainty in the predictions from an ML model. That is, we want to predict P(ŷ|𝜃), where ŷ is a prediction from a model, and 𝜃 represents the parameters of the model.

As we know from fundamental probability theory, probabilities are bound between 0 and 1. The closer we are to 1, the more likely – or probable – the event is. We can view our uncertainty as subtracting our probability from 1. In the context of the example here, the probability of the sum being 5 is P(d1 + d2 = 5|d1 = 3) = 1 6 = 0.166. So, our uncertainty is simply 1 16 = 56 = 0.833, meaning that there’s a > 80% chance that the outcome will not be 5. As we proceed through the book, we’ll learn about different sources of uncertainty, and how uncertainties can help us to develop more robust deep learning systems.

Let’s continue using our dice example to build a better understanding of for model uncertainty estimates. Many common machine learning models work on the basis of maximum likelihood estimation or MLE. That is, they look to predict the value that is most likely: tuning their parameters during training to produce the most likely outcome ŷ given some input x. As a simple illustration, let’s say we want to predict the value of d1 + d2 given a value of d1. We can simply define this as the expectation of d1 + d2 conditioned on d1:

ˆy = 𝔼 [d + d |d ] 1 2 1

That is, the mean of the possible values of d1 + d2.

Setting d1 = 3, our possible values for d1 + d2 are {4,5,6,7,8,9} (as illustrated in Figure 2.2), making our mean:

 1 ∑6 4+ 5 + 6+ 7+ 8 + 9 μ = -- ai = --------------------= 6.5 6 i=1 6

This is the value we’d get from a simple linear model, such as a linear regression defined by:

ˆy = βx + ξ

In this case, the values of our intersection and bias are β = 1, ξ = 3.5. If we change our value of d1 to 1, we see that this mean changes to 4.5 – the mean of the set of possible values of d1 + d2|d1 = 1, in other words {2,3,4,5,6,7}. This perspective on our model predictions is important: while this example is very straightforward, the same principle applies to far more sophisticated models and data. The value we typically see with ML models is the expectation, otherwise known as the mean. As you are likely aware, the mean is often referred to as the first statistical moment – with the second statistical moment being the variance, and the variance allows us to quantify uncertainty.

The variance for our simple example is defined as follows:

 ∑6 2 σ2 = --i=1(ai −-μ) n − 1

These statistical moments should be familiar to you, as should the fact that the variance here is represented as the square of the standard deviation, σ. For our example here, for which we assume d2 is a fair die, the variance will always be constant: σ2 = 2.917. That is to say, given any value of d1, we know that values of d2 are all equally likely, so the uncertainty does not change. But what if we have an unfair die d2, which has a 50% chance of landing on a 6, and a 10% chance of landing on each other number? This changes both our mean and our variance. We can see this by looking at how we would represent this as a set of possible values (in other words, a perfect sample of the die) – the set of possible values for d1 + d2|d1 = 1 now becomes {2,3,4,5,6,7,7,7,7,7}. Our new model will now have a bias of ξ = 4.5, making our prediction:

ˆy = 1 × 1 + 4.5 = 5.5

We see that the expectation has increased due to the change in the underlying probability of the values of die d1. However, the important difference here is in the change in the variance value:

 ∑10 (a − μ)2 σ2 = --i=1--i----- = 3.25 n − 1

Our variance has increased. As variance essentially gives us the average of the distance of each possible value from the mean, this shouldn’t be surprising: given the weighted die, it’s more likely that the outcome will be distant from the mean than with an unweighted die, and thus our variance increases. To summarize, in terms of uncertainty: the greater the likelihood that the outcome will be further from the mean, the greater the uncertainty.

This has important implications for how we interpret predictions from machine learning models (and statistical models more generally). If our predictions are an approximation of the mean, and our uncertainty quantifies how likely it is for an outcome to be distant from the mean, then our uncertainty tells us how likely it is that our model prediction is incorrect. Thus, model uncertainties allow us to decide when to trust the predictions, and when we should be more cautious.

The examples given here are very basic, but should help to give you an idea of what we’re looking to achieve with model uncertainty quantification. We will continue to explore these concepts as we learn about some of the benchmark methods for Bayesian inference, learning how these concepts apply to more complex, real-world problems. We’ll start with perhaps the most fundamental method of Bayesian inference: sampling.

2.2 Bayesian inference via sampling

In practical applications, it’s not possible to know exactly what a given outcome would be, and, similarly, it’s not possible to observe all possible outcomes. In these cases, we need to make a best estimate based on the evidence we have. The evidence is formed of samples – observations of possible outcomes. The aim of ML, broadly speaking, is to learn models that generalize well from a subset of data. The aim of Bayesian ML is to do so while also providing an estimate of the uncertainty associated with the model’s predictions. In this section, we’ll learn about how we can use sampling to do this, and will also learn why sampling may not be the most sensible approach.

2.2.1 Approximating distributions

At the most fundamental level, sampling is about approximating distributions. Say we want to know the distribution of the height of people in New York. We could go out and measure everyone, but that would involve measuring the height of 8.4 million people! While this would give us our most accurate answer, it’s also a deeply impractical approach.

Instead, we can sample from the population. This gives us a basic example of Monte Carlo sampling, where we use random sampling to provide data from which we can approximate a distribution. For example, given a database of New York residents, we could select – at random – a sub-population of residents, and use this to approximate the height distribution of all residents. With random sampling – and any sampling, for that matter – the accuracy of the approximation is dependent on the size of the sub-population. What we’re looking to achieve is a statistically significant sub-sample, such that we can be confident in our approximation.

To get a better imdivssion of this, we’ll simulate the problem by generating 100,000 data points from a truncated normal distribution, to approximate the kind of height distribution we may see for a population of 100,000 people. Say we draw 10 samples, at random, from our population. Here’s what our distribution would look like (on the right) compared with the true distribution (on the left):


Figure 2.3: Plot of true distribution (left) versus sample distribution (right)

As we can see, this isn’t a great representation of the true distribution: what we see here is closer to a triangular distribution than a truncated normal. If we were to infer something about the population’s height based on this distribution alone, we’d arrive at a number of inaccurate conclusions, such as missing the truncation above 200 cm, and the tail on the left of the distribution.

We can get a better imdivssion by increasing our sample size – let’s try drawing 100 samples:


Figure 2.4: Plot of true distribution (left) versus sample distribution (right).

Things are starting to look better: we’re starting to see some of the tail on the left as well as the truncation toward 200 cm. However, this sample has sampled more from some regions than others, leading to misrepresentation: our mean has been pulled down, and we’re seeing two distinct peaks, rather than the single peak we see in the true distribution. Let’s increase our sample size by a further order of magnitude, scaling up to 1,000 samples:


Figure 2.5: Plot of true distribution (left) versus sample distribution (right)

This is looking much better – with a sample set of only one hundredth the size of our true population, we now see a distribution that closely matches our true distribution. This example demonstrates how, through random sampling, we can approximate the true distribution using a significantly smaller pool of observations. But that pool still has to have enough information to allow us to arrive at a good approximation of the true distribution: too few samples and our subset will be statistically insufficient, leading to poor approximation of the underlying distribution.

But simple random sampling isn’t the most practical method for approximating distributions. To achieve this, we turn to probabilistic inference. Given a model, probabilistic inference provides a way to find the model parameters that best describe our data. To do so, we need to first define the type of model – this is our prior. For our example, we’ll use a truncated Gaussian: the idea here being, using our intuition, it’s reasonable to assume people’s height follows a normal distribution, but that very few people are above, say, 6’5.” So, we’ll specify a truncated Gaussian distribution with an upper limit of 205 cm, or just over 6’5.” As it’s a Gaussian distribution, in other words, 𝒩(μ,σ), our model parameters are 𝜃 = {μ,σ} – with the additional constraint that our distribution has an upper limit of b = 205.

This brings us to a fundamental class of algorithms: Markov Chain Monte Carlo, or MCMC methods. Like simple random sampling, these allow us to build a picture of the true underlying distribution, but they do so sequentially, whereby each sample is dependent on the sample before it. This sequential dependence is known as the Markov property, thus the Markov chain component of the name. This sequential approach accounts for the probabilistic dependence between samples and allows us to better approximate the probability density.

MCMC achieves this through sequential random sampling. Just as with the random sampling we’re familiar with, MCMC randomly samples from our distribution. But, unlike simple random sampling, MCMC considers pairs of samples: some previous sample xt1 and some current sample xt. For each pair of samples, we have some criteria that specifies whether or not we keep the sample (this varies depending on the particular flavor of MCMC). If the new value meets this criteria, say if xt is ”preferential to” our previous value xt1, then the sample is added to the chain and becomes xt for the next round. If the sample doesn’t meet the criteria, we stick with the current xt for the next round. We repeat this over a (usually large) number of iterations, and in the end we should arrive at a good approximation of our distribution.

The result is an efficient sampling method that is able to closely approximate the true parameters of our distribution. Let’s see how this applies to our height distribution example. Using MCMC with just 10 samples, we arrive at the following approximation:


Figure 2.6: Plot of true distribution (left) versus approximate distribution via MCMC (right)

Not bad for ten samples – certainly far better than the triangular distribution we arrived at with simple random sampling. Let’s see how we do with 100:


Figure 2.7: Plot of true distribution (left) versus approximate distribution via MCMC (right)

This is looking pretty excellent – in fact, we’re able to obtain a better approximation of our distribution with 100 MCMC samples than we are with 1,000 simple random samples. If we continue to larger numbers of samples, we’ll arrive at closer and closer approximations of our true distribution. But our simple example doesn’t fully capture the power of MCMC: MCMC’s true advantage comes from being able to approximate high-dimensional distributions, and has made it an invaluable technique for approximating intractable high-dimensional integrals in a variety of domains.

In this book, we’re interested in how we can estimate the probability distribution of the parameters of machine learning models – this allows us to estimate the uncertainty associated with our predictions. In the next section, we’ll take a look at how we do this practically by applying sampling to Bayesian linear regression.

2.2.2 Implementing probabilistic inference with Bayesian linear regression

In typical linear regression, we want to predict some output ŷ from some input x using a linear function f(x), such that ŷ = βx + ξ. With Bayesian linear regression, we do this probabilistically, introducing another parameter, σ2, such that our regression equation becomes:

ˆy = 𝒩 (x β + ξ,σ2 )

That is, ŷ follows a Gaussian distribution.

Here, we see our familiar bias term ξ and intercept β, and introduce a variance parameter σ2. To fit our model, we need to define a prior over these parameters – just as we did for our MCMC example in the last section. We’ll define these priors as:

ξ ≈ 𝒩 (0,1 )
β ≈ 𝒩 (0,1)
σ2 ≈ |𝒩 (0,1)|

Note that equation 2.15 denotes the half-normal of a Gaussian distribution (the positive half of a zero-mean Gaussian, as standard deviation cannot be negative). We’ll refer to our model parameters as 𝜃 = β,ξ,σ2, and we’ll use sampling to find the parameters that maximise the likelihood of these given our data, in other words, the conditional probability of our parameters given our data D: P(𝜃|D).

There are a variety of MCMC sampling approaches we could use to find our model parameters. A common approach is to use the Metropolis-Hastings algorithm. Metropolis-Hastings is particularly useful for sampling from intractable distributions. It does so through the use of a proposal distribution, Q(𝜃′|𝜃), which is proportional to, but not exactly equal to, our true distribution. This means that, for example, if some value x1 is twice as likely as some other value x2 in our true distribution, this will be true of our proposal distribution too. Because we’re interested in the probability of observations, we don’t need to know what the exact value would be in our true distribution – we just need to know that, proportionally, our proposal distribution is equivalent to our true distribution.

Here are the key steps of Metropolis-Hastings for our Bayesian linear regression.

First, we initialize with an arbitrary point 𝜃 sampled from our parameter space, according to the priors for each of our parameters. Using a Gaussian distribution centered on our first set of parameters 𝜃, select a new point 𝜃. Then, for each iteration t T, do the following:

  1. Calculate the acceptance criteria, defined as:

     P(𝜃′|D ) α = -------- P(𝜃|D )
  2. Generate a random number from a uniform distribution 𝜖 [0,1]. If 𝜖 <= α, accept the new candidate parameters – adding these to the chain, assigning 𝜃 = 𝜃. If 𝜖 > α, keep the current 𝜃 and draw a new value.

This acceptance criteria means that, if our new set of parameters have a higher likelihood than our last set of parameters, we’ll see α > 1, in which case α < 𝜖. This means that, when we sample parameters that are more likely given our data, we’ll always accept these parameters. If, on the other hand, α < 1, there’s a chance we’ll reject the parameters, but we may also accept them – allowing us to explore regions of lower likelihood.

These mechanics of Metropolis-Hastings result in samples that can be used to compute high-quality approximations of our posterior distribution. Practically, Metropolis-Hastings (and MCMC methods more generally) requires a burn-in phase – an initial phase of sampling used to escape regions of low density, which are typically encountered given the arbitrary initialization.

Let’s apply this to a simple problem: we’ll generate some data for the function y = x2 + 5 + η, where η is a noise parameter distributed according to η ≈𝒩(0,5). Using Metropolis-Hastings to fit our Bayesian linear regressor, we get the following fit using the points sampled from our function (represented by the crosses):


Figure 2.8: Bayesian linear regression on generated data with low variance

We see that our model fits the data in the same way we would expect for standard linear regression. However, unlike standard linear regression, our model produces predictive uncertainty: this is represented by the shaded region. This predictive uncertainty gives an imdivssion of how much our underlying data varies; this makes this model much more useful than a standard linear regression, as now we can get an imdivssion of the sdivad of our data, as well as the general trend. We can see how this varies if we generate new data and fit again, this time increasing the sdivad of the data by modifying our noise distribution to η ≈𝒩(0,20):


Figure 2.9: Bayesian linear regression on generated data with high variance

We see that our predictive uncertainty has increased proportionally to the sdivad of the data. This is an important property in uncertainty-aware methods: when we have small uncertainty, we know our prediction fits the data well, whereas when we have large uncertainty, we know to treat our prediction with caution, as it indicates the model isn’t fitting this region particularly well. We’ll see a better example of this in the next section, which will go on to demonstrate how regions of more or less data contribute to our model uncertainty estimates.

Here, we see that our predictions fit our data pretty well. In addition, we see that σ2 varies according to the availability of data in different regions. What we’re seeing here is a great example of a very important concept, well calibrated uncertainty – also termed high-quality uncertainty. This refers to the fact that, in regions where our Predictions are inaccurate, our uncertainty is also high. Our uncertainty estimates are poorly calibrated if we’re very confident in regions with inaccurate predictions, or very uncertain in regions with accurate predictions. As it’s well-calibrated, sampling is often used as a benchmark for uncertainty quantification.

Unfortunately, while sampling is effective for many applications, the need to obtain many samples for each parameter means that it quickly becomes computationally prohibitive for high dimensions of parameters. For example, if we wanted to start sampling parameters for complex, non-linear relationships (such as sampling the weights of a neural network), sampling would no longer be practical. Despite this, it’s still useful in some cases, and later we’ll see how various BDL methods make use of sampling.

In the next section, we’ll explore the Gaussian process – another fundamental method for Bayesian inference, and a method that does not suffer from the same computational overheads as sampling.

2.3 Exploring the Gaussian process

As we’ve seen in the previous section, sampling quickly becomes prohibitively expensive. To address this, we can use ML models specifically designed to produce uncertainty estimates – the gold standard of which is the Gaussian process.

The Gaussian process, or GP, has become a staple probabilistic ML model, seeing use in a broad variety of applications from pharmacology through to robotics. Its success is largely down to its ability to produce high-quality uncertainty estimates over its predictions in a well-principled fashion. So, what do we mean by a Gaussian process?

In essence, a GP is a distribution over functions. To understand what we mean by this, let’s take a typical ML use case. We want to learn some function f(x), which maps a series of inputs x onto a series of outputs y, such that we can approximate our output via y = f(x). Before we see any data, we know nothing about our underlying function; there is an infinite number of possible functions this could be:


Figure 2.10: Illustration of space of possible functions before seeing data

Here, the black line is the true function we wish to learn, while the dotted lines are the possible functions given the data (in this case, no data). Once we observe some data, we see that the number of possible functions becomes more constrained, as we see here:


Figure 2.11: Illustration of space of possible functions after seeing some data

Here, we see that our possible functions all pass through our observed data points, but outside of those data points, our functions take on a range of very different values. In a simple linear model, we don’t care about these deviations in possible values: we’re happy to interpolate from one data point to another, as we see in Figure 2.12:


Figure 2.12: Illustration of linearly interpolating through our observations

But this interpolation can lead to wildly inaccurate predictions, and has no way of accounting for the degree of uncertainty associated with our model predictions. The deviations that we see here in the regions without data points are exactly what we want to capture with our GP. When there are a variety of possible values our function can take, then there is uncertainty – and through capturing the degree of uncertainty, we are able to estimate what the possible variation in these regions may be.

Formally, a GP can be defined as a function:

f(x) ≈ GP (m (x),k(x,x′))

Here, m(x) is simply the mean of our possible function values for a given point x:

m (x) = 𝔼[f (x)]

The next term, k(x,x) is a covariance function, or kernel. This is a fundamental component of the GP as it defines the way we model the relationship between different points in our data. GPs use the mean and covariance functions to model the space of possible functions, and thus to produce predictions as well as their associated uncertainties. Now that we’ve introduced some of the high-level concepts, let’s dig a little deeper and understand exactly how it is they model the space of possible functions, and thus estimate uncertainty. To do this, we need to understand GP priors.

2.3.1 Defining our prior beliefs with kernels

GP kernels describe the prior beliefs we have about our data, and so you’ll often see them referred to as GP priors. In the same way that the prior in equation 2.3 tells us something about the probability of the outcome of our two dice rolls, the GP prior tells us something important about the relationship we expect from our data.

While there are advanced methods for inferring a prior from our data, they are beyond the scope of this book. We will instead focus on more traditional uses of GPs, for which we select a prior using our knowledge of the data we’re working with.

In the literature and any implementations you encounter, you’ll see that the GP prior is often referred to as the kernel or covariance function (just as we have here). These three terms are all interchangeable, but for consistency with other work, we will henceforth refer to this as the kernel. Kernels simply provide a means of calculating a distance between two data points, and are exdivssed as k(x,x), where x and xare data points, and k() represents the function of the kernel. While the kernel can take on many forms, there are a small number of fundamental kernels that are used in a large proportion of GP applications.

Perhaps the most commonly encountered kernel is the squared exponential or radial basis function (RBF) kernel. This kernel takes the form:

 (x − x ′)2 k(x,x ′) = σ2exp − ----2---- 2l

This introduces us to a couple of common kernel parameters: l and σ2. The output variance parameter σ2 is simply a scaling factor, used to control the distance of the function from its mean. The length scale parameter l controls the smoothness of the function – in other words, how much your function is expected to vary across particular dimensions. This parameter can either be a scalar that is applied to all input dimensions, or a vector with a different scalar value for each input dimension. The latter is often achieved using Automatic Relevance Determination, or ARD, which identifies the relevant values in the input space.

GPs make predictions via a covariance matrix based on the kernel – essentially comparing a new data point to previously observed data points. However, just as with all ML models, GPs need to be trained, and this is where the length scale comes in. The length scale forms the parameters of our GP, and through the training process it learns the optimal value(s) for the length scale(s). This is typically done using a nonlinear optimizer, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. Many optimizers can be used, including optimizers you may be familiar with for deep learning, such as stochastic gradient descent and its variants.

Let’s take a look at how different kernels affect GP predictions. We’ll start with a straightforward example – a simple sine wave:


Figure 2.13: Plot of sine wave with four sampled points

We can see the function illustrated here, as well as some points sampled from this function. Now, let’s fit a GP with a periodic kernel to the data. The periodic kernel is defined as:

 ′ 2 ( 2sin2(π |x − x′|∕p)) kper(x, x) = σ exp -------l2--------

Here, we see a new parameter: p. This is simply the period of the periodic function. Setting p = 1 and applying a GP with a periodic kernel to the preceding example, we get the following:


Figure 2.14: Plot of posterior predictions from a periodic kernel with p = 1

This looks pretty noisy, but you should be able to see that there is clear periodicity in the functions produced by the posterior. It’s noisy for a couple of reasons: a lack of data, and a poor prior. If we’re limited on data, we can try to fix the problem by improving our prior. In this case, we can use our knowledge of the periodicity of the function to improve our prior by setting p = 6:


Figure 2.15: Plot of posterior predictions from a periodic kernel with p = 6

We see that this fits the data pretty well: we’re still uncertain in regions for which we have little data, but the periodicity of our posterior now looks sensible. This is possible because we’re using an informative prior; that is, a prior that incorporates information that describes the data well. This prior is composed of two key components:

  • Our periodic kernel

  • Our knowledge about the periodicity of the function

We can see how important this is if we modify our GP to use an RBF kernel:


Figure 2.16: Plot of posterior predictions from an RBF kernel

With an RBF kernel, we see that things are looking pretty chaotic again: because we have limited data and a poor prior, we’re unable to appropriately constrain the space of possible functions to fit our true function. In the ideal case, we’d fix this by using a more appropriate prior, as we saw in Figure 2.15 – but this isn’t always possible. Another solution is to sample more data. Sticking with our RBF kernel, we sample 10 data points from our function and re-train our GP:


Figure 2.17: Plot of posterior predictions from an RBF kernel, trained on 10 observations

This is looking much better – but what if we have more data and an informative prior?


Figure 2.18: Plot of posterior predictions from a periodic kernel with p = 6, trained on 10 observations

The posterior now fits our true function very closely. Because we don’t have infinite data, there are still some areas of uncertainty, but the uncertainty is relatively small.

Now that we’ve seen some of the core principles in action, let’s return to our example from Figures 2.10-2.12. Here’s a quick reminder of our target function, our posterior samples, and the linear interpolation we saw earlier:


Figure 2.19: Plot illustrating the difference between linear interpolation and the true function

Now that we’ve got some idea of how a GP will affect our predictive posterior, it’s easy to see that linear interpolation falls very short of what we achieve with a GP. To illustrate this more clearly, let’s take a look at what the GP prediction would be for this function given the three samples:


Figure 2.20: Plot illustrating the difference between GP predictions and the true function

Here, the dotted lines are our mean (μ) predictions from the GP, and the shaded area is the uncertainty associated with those predictions – the standard deviation (σ) around the mean. Let’s contrast what we see in Figure 2.20 with Figure 2.19. The differences may seem subtle at first, but we can clearly see that this is no longer a straightforward linear interpolation: the predicted values from the GP are being ”pulled” toward our actual function values. As with our earlier sine wave examples, the behavior of the GP predictions are affected by two key factors: the prior (or kernel) and the data.

But there’s another crucial detail illustrated in Figure 2.20: the predictive uncertainties from our GP. We see that, unlike many typical ML models, a GP gives us uncertainties associated with its predictions. This means we can make better decisions about what we do with the model’s predictions – having this information will help us to ensure that our systems are more robust. For example, if the uncertainty is too great, we can fall back to a manual system. We can even keep track of data points with high predictive uncertainty so that we can continuously refine our models.

We can see how this refinement affects our predictions by adding a few more observations – just as we did in the earlier examples:


Figure 2.21: Plot illustrating the difference between GP predictions and the true function, trained on 5 observations

Figure 2.21 illustrates how our uncertainty changes over regions with different numbers of observations. We see here that between x = 3 and x = 4 our uncertainty is quite high. This makes a lot of sense, as we can also see that our GP’s mean predictions deviate significantly from the true function values. Conversely, if we look at the region between x = 0.5 and x = 2, we can see that our GP’s predictions follow the true function fairly closely, and our model is also more confident about these predictions, as we can see from the smaller interval of uncertainty in this region.

What we’re seeing here is a great example of a very important concept: well calibrated uncertainty – also termed high-quality uncertainty. This refers to the fact that, in regions where our predictions are inaccurate, our uncertainty is also high. Our uncertainty estimates are poorly calibrated if we’re very confident in regions with inaccurate predictions, or very uncertain in regions with accurate predictions.

GPs are what we can term a well principled method – this means that they have solid mathematical foundations, and thus come with strong theoretical guarantees. One of these guarantees is that they are well calibrated, and this is what makes GPs so popular: if we use GPs, we know we can rely on their uncertainty estimates.

Unfortunately, however, GPs are not without their shortcomings – we’ll learn more about these in the following section.

2.3.2 Limitations of Gaussian processes

Given the fact that GPs are well-principled and capable of producing high-quality uncertainty estimates, you’d be forgiven for thinking they’re the perfect uncertainty-aware ML model. GPs struggle in a few key situations:

  • High-dimensional data

  • Large amounts of data

  • Highly complex data

The first two points here are largely down to the inability of GPs to scale well. To understand this, we just need to look at the training and inference procedures for GPs. While it’s beyond the scope of this book to cover this in detail, the key point here is in the matrix operations required for GP training.

During training, it is necessary to invert a D × D matrix, where D is the dimensionality of our data. Because of this, GP training quickly becomes computationally prohibitive. This can be somewhat alleviated through the use of Cholesky deomposition, rather than direct matrix inversion. As well as being more computationally efficient, Cholesky decomposition is also more numerically stable. Unfortunately, Cholesky decomposition also has its weaknesses: computationally, its complexity is O(n3). This means that, as the size of our dataset increases, GP training becomes more and more expensive.

But it’s not only training that’s affected: because we need to compute the covariance between a new data point and all observed data points at inference, GPs have a O(n2) computational complexity at inference.

As well as the computational cost, GPs aren’t light in memory: because we need to store our covariance matrix K, GPs have a O(n2) memory complexity. Thus, in the case of large datasets, even if we have the compute resources necessary to train them, it may not be practical to use them in real-world applications due to their memory requirements.

The last point in our list concerns the complexity of data. As you are probably aware – and as we’ll touch on in Chapter 3, Fundamentals of Deep Learning – one of the major advantages of DNNs is their ability to process complex, high-dimensional data through layers of non-linear transformations. While GPs are powerful, they’re also relatively simple models, and they’re not able to learn the kinds of powerful feature representations that are possible with DNNs.

All of these factors mean that, while GPs are an excellent choice for relatively low-dimensional data and reasonably small datasets, they aren’t practical for many of the complex problems we face in ML. And so, we turn to BDL methods: methods that have the flexibility and scalability of deep learning, while also producing model uncertainty estimates.

2.4 Summary

In this chapter, we’ve covered some of the fundamental concepts and methods related to Bayesian inference. First, we reviewed Bayes’ theorem and the fundamentals of probability theory – allowing us to understand the concept of uncertainty, as well as how we apply it to the predictions of ML models. Next, we introduced sampling, and an important class of algorithms: Markov Chain Monte Carlo, or MCMC, methods. Lastly, we covered Gaussian processes, and illustrated the crucial concept of well calibrated uncertainty. These key topics will provide you with the necessary foundation for the content that will follow, however, we encourage you to explore the recommended reading materials for a more comprehensive treatment of the topics introduced in this chapter.

In the next chapter, we will see how DNNs have changed the landscape of machine learning over the last decade, exploring the tremendous advantages offered by deep learning, and the motivation behind the development of BDL methods.

2.5 Further reading

There are a variety of techniques being explored to improve the flexibility and scalability of GPs – such as Deep GPs or Sparse GPs. The following resources explore some of these topics, and also provide a more thorough treatment of the content covered in this chapter:

  • Bayesian Analysis with Python, Martin: this book comprehensively covers core topics in statistical modeling and probabilistic programming, and includes practical walk-throughs of various sampling methods, as well as a good overview of Gaussian processes and a variety of other techniques core to Bayesian analysis.

  • Gaussian Processes for Machine Learning, Rasmussen and Williams: this is often considered the definitive text on Gaussian processes, and provides highly detailed explanations of the theory underlying Gaussian processes. A key text for anyone serious about Bayesian inference.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Gain insights into the limitations of typical neural networks
  • Acquire the skill to cultivate neural networks capable of estimating uncertainty
  • Discover how to leverage uncertainty to develop more robust machine learning systems


Deep learning has an increasingly significant impact on our lives, from suggesting content to playing a key role in mission- and safety-critical applications. As the influence of these algorithms grows, so does the concern for the safety and robustness of the systems which rely on them. Simply put, typical deep learning methods do not know when they don’t know. The field of Bayesian Deep Learning contains a range of methods for approximate Bayesian inference with deep networks. These methods help to improve the robustness of deep learning systems as they tell us how confident they are in their predictions, allowing us to take more in how we incorporate model predictions within our applications. Through this book, you will be introduced to the rapidly growing field of uncertainty-aware deep learning, developing an understanding of the importance of uncertainty estimation in robust machine learning systems. You will learn about a variety of popular Bayesian Deep Learning methods, and how to implement these through practical Python examples covering a range of application scenarios. By the end of the book, you will have a good understanding of Bayesian Deep Learning and its advantages, and you will be able to develop Bayesian Deep Learning models for safer, more robust deep learning systems.

What you will learn

Understand advantages and disadvantages of Bayesian inference and deep learning Understand the fundamentals of Bayesian Neural Networks Understand the differences between key BNN implementations/approximations Understand the advantages of probabilistic DNNs in production contexts How to implement a variety of BDL methods in Python code How to apply BDL methods to real-world problems Understand how to evaluate BDL methods and choose the best method for a given task Learn how to deal with unexpected data in real-world deep learning applications

Product Details

Country selected

Publication date : Jun 30, 2023
Length 386 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246888
Category :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details

Publication date : Jun 30, 2023
Length 386 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246888
Category :
Concepts :

Table of Contents

11 Chapters
Preface Chevron down icon Chevron up icon
1. Chapter 1: Bayesian Inference in the Age of Deep Learning Chevron down icon Chevron up icon
2. Chapter 2: Fundamentals of Bayesian Inference Chevron down icon Chevron up icon
3. Chapter 3: Fundamentals of Deep Learning Chevron down icon Chevron up icon
4. Chapter 4: Introducing Bayesian Deep Learning Chevron down icon Chevron up icon
5. Chapter 5: Principled Approaches for Bayesian Deep Learning Chevron down icon Chevron up icon
6. Chapter 6: Using the Standard Toolbox for Bayesian Deep Learning Chevron down icon Chevron up icon
7. Chapter 7: Practical Considerations for Bayesian Deep Learning Chevron down icon Chevron up icon
8. Chapter 8: Applying Bayesian Deep Learning Chevron down icon Chevron up icon
9. Chapter 9: Next Steps in Bayesian Deep Learning Chevron down icon Chevron up icon
10. Why subscribe? Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by

No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial


How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to
  • To contact us directly if a problem is not resolved, use
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.