Reader small image

You're reading from  Bayesian Analysis with Python - Third Edition

Product typeBook
Published inJan 2024
Reading LevelExpert
PublisherPackt
ISBN-139781805127161
Edition3rd Edition
Languages
Right arrow
Author (1)
Osvaldo Martin
Osvaldo Martin
author image
Osvaldo Martin

Osvaldo Martin is a researcher at CONICET, in Argentina. He has experience using Markov Chain Monte Carlo methods to simulate molecules and perform Bayesian inference. He loves to use Python to solve data analysis problems. He is especially motivated by the development and implementation of software tools for Bayesian statistics and probabilistic modeling. He is an open-source developer, and he contributes to Python libraries like PyMC, ArviZ and Bambi among others. He is interested in all aspects of the Bayesian workflow, including numerical methods for inference, diagnosis of sampling, evaluation and criticism of models, comparison of models and presentation of results.
Read more about Osvaldo Martin

Right arrow

Chapter 10
Inference Engines

The first principle is that you must not fool yourself—and you are the easiest person to fool. – Richard Feynman

So far, we have focused on model building, interpretation of results, and criticism of models. We have relied on the magic of the pm.sample function to compute posterior distributions for us. Now we will focus on learning some of the details of the inference engines behind this function.

The whole purpose of probabilistic programming tools, such as PyMC, is that the user should not care about how sampling is carried out, but understanding how we get samples from the posterior is important for a full understanding of the inference process, and could also help us to get an idea of when and how these methods fail and what to do about it. If you are not interested in understanding how these methods work, you can skip most of this chapter, but I strongly recommend you at least read the Diagnosing samples section, as this section...

10.1 Inference engines

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. The main reason is that the marginal likelihood, the denominator in Bayes’ theorem, usually takes the form of an intractable or computationally expensive integral to solve. For this reason, the posterior is usually estimated numerically using algorithms from the Markov Chain Monte Carlo (MCMC) family. These methods are sometimes called inference engines, because, at least in principle, they are capable of approximating the posterior distribution for any probabilistic model. Even though inference does not always work that well in practice, the existence of such methods has motivated the development of probabilistic programming languages such as PyMC.

The goal of probabilistic programming languages is to separate the model-building process from the inference process to facilitate the iterative steps of model-building, evaluation, and model modification/expansion...

10.2 The grid method

The grid method is a simple brute-force approach. Even if you are not able to compute the whole posterior, you may be able to compute the prior and the likelihood point-wise; this is a pretty common scenario, if not the most common one.

Let’s assume we want to compute the posterior for a model with a single parameter. The grid approximation is as follows:

  1. Define a reasonable interval for the parameter (the prior should give you a hint).

  2. Place a grid of points (generally equidistant) on that interval.

  3. For each point in the grid, multiply the likelihood and the prior.

Optionally, we may normalize the computed values, that is, we divide each value in the posterior array by the total area under the curve, ensuring that the total area equals 1.

The following code block implements the grid method for the coin-flipping model:

Code 10.1

def posterior_grid(grid_points=50, heads=6, tails=9): 
    """ 
 ...

10.3 Quadratic method

The quadratic approximation, also known as the Laplace method or the normal approximation, consists of approximating the posterior with a Gaussian distribution. To do this, we first find the model of the posterior distribution; numerically, we can do this with an optimization method. Then we compute the Hessian matrix, from which we can then estimate the standard deviation. If you are wondering, the Hessian matrix is a square matrix of second-order partial derivatives. For what we care we can use it to obtain the standard deviation of in general a covariance matrix.

Bambi can solve Bayesian models using the quadratic method for us. In the following code block, we first define a model for the coin-flipping problem, the same one we already defined for the grid method, and then we fit it using the quadratic method, called laplace in Bambi:

Code 10.2

data = pd.DataFrame(data, columns=["w"]) 
priors = {"Intercept": bmb.Prior("Uniform...

10.4 Markovian methods

There is a family of related methods, collectively known as the Markov chain Monte Carlo or MCMC methods. These are stochastic methods that allow us to get samples from the true posterior distribution as long as we can compute the likelihood and the prior point-wise. You may remember that this is the same condition we needed for the grid method, but contrary to them, MCMC methods can efficiently sample from higher-probability regions in very high dimensions.

MCMC methods visit each region of the parameter space following their relative probabilities. If the probability of region A is twice that of region B, we will obtain twice as many samples from A as we will from B. Hence, even if we are not capable of computing the whole posterior analytically, we could use MCMC methods to take samples from it. In theory, MCMC will give us samples from the correct distribution – the catch is that this theoretical guarantee only holds asymptotically, that is, for an infinite...

10.5 Sequential Monte Carlo

One of the caveats of Metropolis-Hastings and NUTS (and other Hamiltonian Monte Carlo variants) is that if the posterior has multiple peaks and these peaks are separated by regions of very low probability, these methods can get stuck in a single mode and miss the others!

Many of the methods developed to overcome this multiple minima problem are based on the idea of tempering. This idea, once again, is borrowed from statistical mechanics. The number of states a physical system can populate depends on the temperature of the system; at 0 Kelvin (the lowest possible temperature), every system is stuck in a single state. On the other extreme, for an infinite temperature, all possible states are equally likely. Generally, we are interested in systems at some intermediate temperature. For Bayesian models, there is a very intuitive way to adapt this tempering idea by writing Bayes’ theorem with a twist.

 β p(θ | y)β = p(y | θ) p(θ)

The parameter β is known as the inverse temperature...

10.6 Diagnosing the samples

In this book, we have used numerical methods to compute the posterior for virtually all models. That will most likely be the case for you, too, when using Bayesian methods for your own problems. Since we are approximating the posterior with a finite number of samples, it is important to check whether we have a valid sample; otherwise, any analysis from it will be totally flawed. There are several tests we can perform, some of which are visual and others quantitative. These tests are designed to spot problems with our samples, but they are unable to prove we have the correct distribution; they can only provide evidence that the sample seems reasonable. If we find problems with the sample, there are many solutions to try. We will discuss them along with the diagnostics.

To make the explanations concrete, we are going to use minimalist models, with two parameters: a global parameter a and a group parameter b. And that’s it, we do not even have likelihood...

10.7 Convergence

Theoretically, MCMC methods are guaranteed to converge once we take infinite samples. In practice, we need to check that we have reasonable finite samples. Usually, we say the sampler has converged once we have collected evidence showing that samples are stable in some sense. A simple test to do is to run the same MCMC simulation multiple times and check whether we get the same result every time. This is the reason why PyMC runs more by default than on chain. For modern computers, this is virtually free as we have multiple cores. Also, they do not create any waste, as we can combine samples from different chains to compute summaries, plots, etc.

There are many ways to check that different chains are practically equivalent, both visually and with formal tests. We are not going to get too technical here; we are just going to show a few examples and hope they are enough for you to develop an intuition for interpreting diagnostics.

10.7.1 Trace plot

One way to check...

10.8 Effective Sample Size (ESS)

MCMC samples can be correlated. The reason is that we use the current position to generate a new position and we accept or reject the next position taking into account the old position. This dependency is usually lower for well-tuned modern methods, such as Hamiltonian Monte Carlo, but it can be high. We can compute and plot the autocorrelation with az.plot_autocorrelation. But usually, a more useful metric is to compute the Effective Sample Size (ESS). We can think of this number as the number of useful draws we have in our sample. Due to autocorrelation, this number is usually going to be lower than the actual number of samples. We can compute it using the az.ess function (see Table 10.2). The ESS diagnostic is also computed by default with the az.summary function and optionally with az.plot_forest (using the ess=True argument).

a b0 b1 b2 b3 b4 b5 b6 b7 b8 b9
model_cm 14 339 3893 5187 4025 5588 4448 4576 4025 4249 4973
model_ncm 2918 4100 4089...

10.9 Monte Carlo standard error

Even if we have a very low and a very high value of ESS. The samples from MCMC are still finite, and thus we are introducing an error in the estimation of the posterior parameters. Fortunately, we can estimate the error, and it is called the Monte Carlo Standard Error (MCSE). The estimation of the MCSE takes into account that the samples are not truly independent of each other. The precision we want in our results is limited by this value. If the MCSE for a parameter is 0.2, it does not make sense to report a parameter as 2.54. Instead, if we repeat the simulation (with a different random seed), we should expect that for 68% of the results, we obtain values in the range 2.54 ± 0.2. Similarly, for 95% of them, we should get values in the range 2.54 ± 0.4. Here, I am assuming the MCSE distributes normally and then using the fact that 68% of the value of a Gaussian is within one standard deviation and 95% is within two standard...

10.10 Divergences

We will now explore divergences, a diagnostic that is exclusive to NUTS, as it is based on the inner workings of the method and not a property of the generated samples. Divergences are a powerful and sensitive method that indicate the sampler has most likely found a region of high curvature in the posterior that cannot be explored properly. A nice feature of divergences is that they usually appear close to the problematic parameter space region, and thus we can use them to identify where the problem may be.

Let’s discuss divergences with a visual aid:

PIC

Figure 10.14: Pair plot for selected parameters from models model_c and model_nc

As you can see, Figure 10.14 shows the following three subplots:

  • The left subplot: We have a scatter plot for two parameters of model model_c; namely, one dimension of the parameter b (we just picked one at random – feel free to pick a different one), and the logarithm of the parameter a. We take the logarithm because...

10.11 Keep calm and keep trying

What should we do when diagnostics show problems? We should try to fix them. Sometimes, PyMC will provide suggestions on what to change. Pay attention to those suggestions, and you will save a lot of debugging time. Here, I have listed a few common actions you could take:

  • Check for typos or other silly mistakes. It is super common even for experts to make ”silly” mistakes. If you misspell the name of a variable, it is highly likely that the model will not even run. But sometimes the mistake is more subtle, and you still get a syntactically valid model that runs, but with the wrong semantics.

  • Increase the number of samples. This might help for very mild problems, like when you’re close to the target ESS (or MCSE), or when ^R is slightly higher than 1.01 but not too much.

  • Remove some samples from the beginning of the trace. When checking a trace plot, you may observe that a few samples from the first few steps have overall higher or...

10.12 Summary

In this chapter, we have taken a conceptual walk through some of the most common methods used to compute the posterior distribution. We have put special emphasis on MCMC methods, which are designed to work on any given model (or at least a broad range of models), and thus are sometimes called universal inference engines. These methods are the core of any probabilistic programming language as they allow for automatic inference, letting users concentrate on iterative model design and interpretations of the results.

We also discussed numerical and visual tests for diagnosing samples. Without good approximations of the posterior distribution, all the advantages and flexibility of the Bayesian framework vanish. Thus, evaluating the quality of the samples is a crucial step before doing any other type of analysis.

10.13 Exercises

  1. Use the grid method with other priors; for example, try with prior = (grid <= 0.5).astype(int) or prior = abs(grid - 0.5), or try defining your own crazy priors. Experiment with other data, such as increasing the total amount of data or making it more or less even in terms of the number of heads you observe.

  2. In the code we use to estimate π, keep N fixed and re-run the code a couple of times. Notice that the results are different because we are using random numbers, but also check that the errors are more or less in the same order. Try changing the number of N points and re-run the code. Can you guesstimate how the number of N points and the error are related? For a better estimation, you may want to modify the code to compute the error as a function of N. You can also run the code a few times with the same N and compute the mean error and standard deviation of the error. You can plot these results using the plt.errorbar() function from Matplotlib. Try using...

Join our community Discord space

Join our Discord community to meet like-minded people and learn alongside more than 5000 members at: https://packt.link/bayesian

PIC

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Bayesian Analysis with Python - Third Edition
Published in: Jan 2024Publisher: PacktISBN-13: 9781805127161
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Osvaldo Martin

Osvaldo Martin is a researcher at CONICET, in Argentina. He has experience using Markov Chain Monte Carlo methods to simulate molecules and perform Bayesian inference. He loves to use Python to solve data analysis problems. He is especially motivated by the development and implementation of software tools for Bayesian statistics and probabilistic modeling. He is an open-source developer, and he contributes to Python libraries like PyMC, ArviZ and Bambi among others. He is interested in all aspects of the Bayesian workflow, including numerical methods for inference, diagnosis of sampling, evaluation and criticism of models, comparison of models and presentation of results.
Read more about Osvaldo Martin