| ||

--Pierre-Simon Laplace |

In this chapter, we will learn the core concepts of Bayesian statistics and some of the instruments in the Bayesian toolbox. We will use some Python code in this chapter, but this chapter will be mostly theoretical; most of the concepts in this chapter will be revisited many times through the rest of the book. This chapter, being intense on the theoretical side, may be a little anxiogenic for the coder in you, but I think it will ease the path to effectively applying Bayesian statistics to your problems.

In this chapter, we will cover the following topics:

Statistical modeling

Probabilities and uncertainty

Bayes' theorem and statistical inference

Single parameter inference and the classic coin-flip problem

Choosing priors and why people often don't like them, but should

Communicating a Bayesian analysis

Installing all Python packages

Statistics is about collecting, organizing, analyzing, and interpreting data, and hence statistical knowledge is essential for data analysis. Another useful skill when analyzing data is knowing how to write code in a programming language such as Python. Manipulating data is usually necessary given that we live in a messy world with even messier data, and coding helps to get things done. Even if your data is clean and tidy, programming will still be very useful since modern Bayesian statistics is mostly computational statistics.

Most introductory statistical courses, at least for non-statisticians, are taught as a collection of recipes that more or less go like this; go to the the statistical pantry, pick one can and open it, add data to taste and stir until obtaining a consistent p-value, preferably under *0.05* (if you don't know what a p-value is, don't worry; we will not use them in this book). The main goal in this type of course is to teach you how to pick the proper can. We will take a different approach: we will also learn some recipes, but this will be home-made rather than canned food; we will learn how to mix fresh ingredients that will suit different gastronomic occasions. But before we can cook, we must learn some statistical vocabulary and also some concepts.

Data is an essential ingredient of statistics. Data comes from several sources, such as experiments, computer simulations, surveys, field observations, and so on. If we are the ones that will be generating or gathering the data, it is always a good idea to first think carefully about the questions we want to answer and which methods we will use, and only then proceed to get the data. In fact, there is a whole branch of statistics dealing with data collection known as **experimental design**. In the era of data deluge, we can sometimes forget that gathering data is not always cheap. For example, while it is true that the Large Hadron Collider (LHC) produces hundreds of terabytes a day, its construction took years of manual and intellectual effort. In this book we will assume that we already have collected the data and also that the data is clean and tidy, something rarely true in the real world. We will make these assumptions in order to focus on the subject of this book. If you want to learn how to use Python for cleaning and manipulating data and also a primer on machine learning, you should probably read the book *Python Data Science Handbook* by *Jake VanderPlas*.

OK, so let's assume we have our dataset; usually, a good idea is to explore and visualize it in order to get some intuition about what we have in our hands. This can be achieved through what is known as
**Exploratory Data Analysis** (**EDA**), which basically consists of the following:

Descriptive statistics

Data visualization

The first one, descriptive statistics, is about how to use some measures (or statistics) to summarize or characterize the data in a quantitative manner. You probably already know that you can describe data using the mean, mode, standard deviation, interquartile ranges, and so forth. The second one, data visualization, is about visually inspecting the data; you probably are familiar with representations such as histograms, scatter plots, and others. While EDA was originally thought of as something you apply to data before doing any complex analysis or even as an alternative to complex model-based analysis, through the book we will learn that EDA is also applicable to understanding, interpreting, checking, summarizing, and communicating the results of Bayesian analysis.

Sometimes, plotting our data and computing simple numbers, such as the average of our data, is all we need. Other times, we want to make a generalization based on our data. We may want to understand the underlying mechanism that could have generated the data, or maybe we want to make predictions for future (yet unobserved) data points, or we need to choose among several competing explanations for the same observations. That's the job of **inferential statistics**. To do inferential statistics we will rely on probabilistic models. There are many types of models and most of science, and I will add all of our understanding of the *real world*, is through models. The brain is just a machine that models reality (whatever reality might be) see this TED talk about the machine that builds the reality http://www.tedxriodelaplata.org/videos/m%C3%A1quina-construye-realidad.

What are models? Models are simplified descriptions of a given system (or process). Those descriptions are purposely designed to capture only the most relevant aspects of the system, and hence, most models do not pretend they are able to explain everything; on the contrary, if we have a simple and a complex model and both models explain the data more or less equally well, we will generally prefer the simpler one. This heuristic for simple models is known as Occam's razor, and we will discuss how it is related to Bayesian analysis in Chapter 6, *Model Comparison*.

Model building, no matter which type of model you are building, is an iterative process following more or less the same basic rules. We can summarize the Bayesian modeling process using three steps:

Given some data and some assumptions on how this data could have been generated, we will build models. Most of the time, models will be crude approximations, but most of the time this is all we need.

Then we will use Bayes' theorem to add data to our models and derive the logical consequences of mixing the data and our assumptions. We say we are conditioning the model on our data.

Lastly, we will check that the model makes sense according to different criteria, including our data and our expertise on the subject we are studying.

In general, we will find ourselves performing these three steps in a non-linear iterative fashion. Sometimes we will retrace our steps at any given point: maybe we made a silly programming mistake, maybe we found a way to change the model and improve it, maybe we need to add more data.

Bayesian models are also known as **probabilistic models** because they are built using probabilities. Why probabilities? Because probabilities are the correct mathematical tool to model the uncertainty in our data, so let's take a walk through the garden of forking paths.

While Probability Theory is a mature and well-established branch of mathematics, there is more than one interpretation of what probabilities are. To a Bayesian, a probability is a measure that quantifies the uncertainty level of a statement. If we know nothing about coins and we do not have any data about coin tosses, it is reasonable to think that the probability of a coin landing heads could take any value between *0* and *1*; that is, in the absence of information, all values are equally likely, our uncertainty is maximum. If we know instead that coins tend to be balanced, then we may say that the probability of a coin landing is exactly *0.5* or may be around *0.5* if we admit that the balance is not perfect. If now, we collect data, we can update these prior assumptions and hopefully reduce the uncertainty about the bias of the coin. Under this definition of probability, it is totally valid and natural to ask about the probability of life on Mars, the probability of the mass of the electron being *9.1 x 10-31* kg, or the probability of the 9th of July of 1816 being a sunny day. Notice, for example, that the question of whether or not life exists on Mars has a binary outcome but what we are really asking is how likely is it to find life on Mars given our data and what we know about biology and the physical conditions on that planet? The statement is about our state of knowledge and not, directly, about a property of nature. We are using probabilities because we cannot be sure about the events, not because the events are necessarily random. Since this definition of probability is about our epistemic state of mind, sometimes it is referred to as the subjective definition of probability, explaining the slogan of subjective statistics often attached to the Bayesian paradigm. Nevertheless, this definition does not mean all statements should be treated as equally valid and so anything goes; this definition is about acknowledging that our understanding about the world is imperfect and conditioned on the data and models we have made. There is not such a thing as a model-free or theory-free understanding of the world; even if it were be possible to free ourselves from our social preconditioning, we will end up with a biological limitation: our brain, subject to the evolutionary process, has been wired with models of the world. We are doomed to think like humans and we will never think like bats or anything else! Moreover, the universe is an uncertain place and, in general the best we can do is to make probabilistic statements about it. Notice that it does not matter if the underlying reality of the world is deterministic or stochastic; we are using probability as a tool to quantify uncertainty.

Logic is about thinking without making mistakes. Under the Aristotelian or classical logic, we can only have statements taking the values true or false. Under the Bayesian definition of probability, certainty is just a special case: a true statement has a probability of *1*, a false one has probability *0*. We would assign a probability of *1* about life on Mars only after having conclusive data indicating something is growing and reproducing and doing other activities we associate with living organisms. Notice, however, that assigning a probability of *0* is harder because we can always think that there is some Martian spot that is unexplored, or that we have made mistakes with some experiment, or several other reasons that could lead us to falsely believe life is absent on Mars when it is not. Related to this point is Cromwell's rule, stating that we should reserve the use of the prior probabilities of *0* or *1* to logically true or false statements. Interesting enough, Cox mathematically proved that if we want to extend logic to include uncertainty we must use probabilities and probability theory. Bayes' theorem is just a logical consequence of the rules of probability as we will see soon. Hence, another way of thinking about Bayesian statistics is as an extension of logic when dealing with uncertainty, something that clearly has nothing to do with subjective reasoning in the pejorative sense. Now that we know the Bayesian interpretation of probability, let's see some of the mathematical properties of probabilities. For a more detailed study of probability theory, you can read *Introduction to probability* by *Joseph K Blitzstein & Jessica Hwang*.

Probabilities are numbers in the interval *[0, 1]*, that is, numbers between *0* and *1*, including both extremes. Probabilities follow some rules; one of these rules is the product rule:

We read this as follows: the probability of *A* and *B* is equal to the probability of *A* given *B*, times the probability of *B*. The expression *p(A, B)* represents the joint probability of A and *B*. The expression *p(A|B)* is used to indicate a conditional probability; the name refers to the fact that the probability of *A* is conditioned on knowing *B*. For example, the probability that a pavement is wet is different from the probability that the pavement is wet if we know (or given that) is raining. A conditional probability can be larger than, smaller than or equal to the unconditioned probability. If knowing *B* does not provides us with information about *A*, then *p(A|B)=p(A)*. That is *A* and *B* are independent of each other. On the contrary, if knowing *B* gives us useful information about* A*, then the conditional probability could be larger or smaller than the unconditional probability depending on whether knowing *B* makes *A* less or more likely.

Conditional probabilities are a key concept in statistics, and understanding them is crucial to understanding Bayes' theorem, as we will see soon. Let's try to understand them from a different perspective. If we reorder the equation for the product rule, we get the following:

Notice that a conditional probability is always larger or equal than the joint probability. The reasons are that: we do not condition on zero-probability events, this is implied in the expression, and probabilities are restricted to be in the interval *[0, 1]*. Why do we divide by *p(B)?* Knowing *B* is equivalent to saying that we have restricted the space of possible events to B and thus, to find the conditional probability, we take the favorable cases and divide them by the total number of events. It is important to realize that all probabilities are indeed conditionals, there is not such a thing as an absolute probability floating in vacuum space. There is always some model, assumption, or condition, even if we don't notice or know them. The probability of rain is not the same if we are talking about Earth, Mars, or some other place in the Universe. In the same way, the probability of a coin landing heads or tails depends on our assumptions of the coin being biased in one way or another. Now that we are more familiar with the concept of probability, let's jump to the next topic, probability distributions.

A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events. A common and useful conceptualization in statistics is to think that data was generated from some probability distribution with unobserved parameters. Since the parameters are unobserved and we only have data, we will use Bayes' theorem to invert the relationship, that is, to go from the data to the parameters. Probability distributions are the building blocks of Bayesian models; by combining them in proper ways we can get useful complex models.

We will meet several probability distributions throughout the book; every time we discover one we will take a moment to try to understand it. Probably the most famous of all of them is the Gaussian or normal distribution. A variable *x* follows a Gaussian distribution if its values are dictated by the following formula:

In the formula, and are the parameters of the distributions. The first one can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). The second is the standard deviation, which can only be positive and dictates the spread of the distribution. Since there are an infinite number of possible combinations of and values, there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family. Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them can be intimidating; a good way to break the ice is to use Python to explore them. Let's see what the Gaussian distribution family looks like:

import matplotlib.pyplot as plt import numpy as np from scipy import stats import seaborn as sns mu_params = [-1, 0, 1] sd_params = [0.5, 1, 1.5] x = np.linspace(-7, 7, 100) f, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True, sharey=True) for i in range(3): for j in range(3): mu = mu_params[i] sd = sd_params[j] y = stats.norm(mu, sd).pdf(x) ax[i,j].plot(x, y) ax[i,j].plot(0, 0, label="$\\mu$ = {:3.2f}\n$\\sigma$ = {:3.2f}".format (mu, sd), alpha=0) ax[i,j].legend(fontsize=12) ax[2,1].set_xlabel('$x$', fontsize=16) ax[1,0].set_ylabel('$pdf(x)$', fontsize=16) plt.tight_layout()

The output of the preceding code is as follows:

A variable, such as* x*, that comes from a probability distribution is called a **random variable**. It is not that the variable can take any possible value. On the contrary, the values are strictly controlled by the probability distribution; the randomness arises from the fact that we could not predict which value the variable will take, but only the probability of observing those values. A common notation used to say that a variable is distributed as a Gaussian or normal distribution with parameters and is as follows:

The symbol *~* (tilde) is read as *is distributed as*.

There are two types of random variable, continuous and discrete. **Continuous random variables** can take any value from some interval (we can use Python floats to represent them), and the
**discrete random variables** can take only certain values (we can use Python integers to represent them).

Many models assume that successive values of a random variables are all sampled from the same distribution and those values are independent of each other. In such a case, we will say that the variables are independently and identically distributed, or *iid* variables for short. Using mathematical notation, we can see that two variables are independent if for every value of *x* and *y*:

A common example of non *iid* variables are temporal series, where a temporal dependency in the random variable is a key feature that should be taken into account. Take for example the following data coming from http://cdiac.esd.ornl.gov. This data is a record of atmospheric CO2 measurements from 1959 to 1997. We are going to load the data (included with the accompanying code) and plot it.

data = np.genfromtxt('mauna_loa_CO2.csv', delimiter=',') plt.plot(data[:,0], data[:,1]) plt.xlabel('$year$', fontsize=16) plt.ylabel('$CO_2 (ppmv)$', fontsize=16)

Each point corresponds to the measured levels of atmospheric CO2 per month. It is easy to see in this plot the temporal dependency of data points. In fact, we have two trends here, a seasonal one (this is related to cycles of vegetation growth and decay) and a global one indicating an increasing concentration of atmospheric CO2.

Now that we have learned some of the basic concepts and jargon from statistics, we can move to the moment everyone was waiting for. Without further ado let's contemplate, in all its majesty, Bayes' theorem:

Well, it is not that impressive, is it? It looks like an elementary school formula and yet, paraphrasing Richard Feynman, this is all you need to know about Bayesian statistics.

Learning where Bayes' theorem comes from will help us to understand its meaning. In fact, we have already seen all the probability theory necessary to derive it:

According to the product rule, we have the following:

This can also be written as follows:

Given than the terms on the left are equal, we can write the following:

And if we reorder it, we get Bayes' theorem:

Now, let's see what this formula implies and why it is important. First, it says that *p(D|H)* is not necessarily the same as *p(D|H)*. This is a very important fact, one that's easy to miss in daily situations even for people trained in statistics and probability. Let's use a simple example to clarify why these quantities are not necessary the same. The probability of having two legs given these someone is a human is not the same as the probability of being a human given that someone has two legs. Almost all humans have two legs, except for people that have suffered from accidents or birth problems, but a lot of non-human animals have two legs, such as birds.

If we replace *H* with hypothesis and *D* with data, Bayes' theorem tells us how to compute the probability of a hypothesis *H* given the data *D*, and that's the way you will find Bayes' theorem explained in a lot of places. But, how do we turn a hypothesis into something that we can put inside Bayes' theorem? Well, we do that using probability distributions so, in general, our *H* will be a hypothesis in a very narrow sense. What we will be really doing is trying to find parameters of our models, that is, parameters of probability distributions. So maybe, instead of hypothesis, it is better to talk about models and avoid confusion. And by the way, don't try to set *H* to statements such as "unicorns are real", unless you are willing to build a realistic probabilistic model of unicorn existence!

Since Bayes' theorem is central and we will use it over and over again, let's learn the names of its parts:

*p(H)*: Prior*p(D|H)*: Likelihood*p(H|D)*: Posterior*p(D)*: Evidence

The **prior distribution** should reflect what we know about the value of some parameter before seeing the data *D*. If we know nothing, like Jon Snow, we will use flat priors that do not convey too much information. In general, we can do better, as we will learn through this book. The use of priors is why some people still think Bayesian statistics is subjective, even when priors are just another assumption that we made when modeling and hence are just as subjective (or objective) as any other assumption, such as likelihoods.

The **likelihood** is how we will introduce data in our analysis. It is an expression of the plausibility of the data given the parameters.

The **posterior distribution** is the result of the Bayesian analysis and reflects all that we know about a problem (given our data and model). The posterior is a probability distribution for the parameters in our model and not a single value. This distribution is a balance of the prior and the likelihood. There is a joke that says: A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule. One way to kill the mood after hearing this joke is to explain that if the likelihood and priors are both vague you will get a posterior reflecting vague beliefs about seeing a mule rather than strong ones. Anyway the joke captures the idea of a posterior being somehow a compromise between prior and likelihood. Conceptually, we can think of the posterior as the updated prior in the light of the data. In fact, the posterior of one analysis can be used as the prior of a new analysis after collecting new data. This makes Bayesian analysis particularly suitable for analyzing data that becomes available in sequential order. Some examples could be early warning systems for disasters that process *online* data coming from meteorological stations and satellites. For more details read about **online machine learning methods**.

The last term is the **evidence**, also known as marginal likelihood. Formally, the evidence is the probability of observing the data averaged over all the possible values the parameters can take. Anyway, for most of the parts of the book, we will not care about the evidence, and we will think of it as a simple normalization factor. This will not be problematic since we will only care about the relative values of the parameters and not their absolute ones. If we ignore the evidence, we can write Bayes' theorem as a proportionality:

Understanding the exact role of each term will take some time and will also require some examples, and that's what the rest of the book is for.

In the last two sections, we have learned several important concepts, but two of them are essentially the core of Bayesian statistics, so let's restate them in a single sentence. Probabilities are used to measure the uncertainty we have about parameters, and Bayes' theorem is the mechanism to correctly update those probabilities in the light of new data, hopefully reducing our uncertainty.

Now that we know what Bayesian statistics is, let's learn how to do Bayesian statistics with a simple example. We are going to begin inferring a single unknown parameter.

The coin-flip problem is a classical problem in statistics and goes like this. We toss a coin a number of times and record how many heads and tails we get. Based on this data we try to answer questions such as is the coin fair? Or, more generally, how biased is the coin? While this problem may sound dull, we should not underestimate it. The coin-flipping problem is a great example to learn the basic of Bayesian statistics; on the one hand, it is about tossing coins, something familiar to almost anyone; on the other, it is a simple model that we can solve and compute with ease. Besides, many real problems consist of binary mutually exclusive outcomes such as *0* or *1*, positive or negative, odds or evens, spam or ham, safe or unsafe, healthy or unhealthy, and so on. Thus, even when we are talking about coins, this model applies to any of those problems.

In order to estimate the bias of a coin, and in general to answer any questions in a Bayesian setting, we will need data and a probabilistic model. For this example, we will assume that we have already tossed a coin a number of times and we have recorded the number of observed heads, so the data part is done. Getting the model will take a little bit more effort. Since this is our first model, we will do all the necessary math (don't be afraid, I promise it will be painless) and we will proceed step by step very slowly. In the next chapter we will revisit this problem by using PyMC3 to solve it numerically, that is, without us doing the math. Instead we will let PyMC3 and our computer do the math.

The first thing we will do is generalize the concept of bias. We will say that a coin with a bias of *1* will always land heads, one with a bias of *0* will always land tails, and one with a bias of *0.5* will land half of the time heads and half of the time tails. To represent the bias, we will use the parameter , and to represent the total number of heads for an *N* number of tosses, we will use the variable *y*. According to Bayes' theorem we have the following formula:

Notice that we need to specify which prior and likelihood we will use. Let's start with the likelihood.

Let's assume that a coin toss does not affect other tosses, that is, we are assuming coin tosses are independent of each other. Let's also assume that only two outcomes are possible, heads or tails. I hope you agree these are very reasonable assumptions to make for our problem. Given these assumptions, a good candidate for the likelihood is the binomial distribution:

This is a discrete distribution returning the probability of getting *y* heads (or in general, success) out of *N* coin tosses (or in general, trials or experiments) given a fixed value of . The following code generates 9 binomial distributions; each subplot has its own legend indicating the corresponding parameters:

n_params = [1, 2, 4] p_params = [0.25, 0.5, 0.75] x = np.arange(0, max(n_params)+1) f, ax = plt.subplots(len(n_params), len(p_params), sharex=True, sharey=True) for i in range(3): for j in range(3): n = n_params[i] p = p_params[j] y = stats.binom(n=n, p=p).pmf(x) ax[i,j].vlines(x, 0, y, colors='b', lw=5) ax[i,j].set_ylim(0, 1) ax[i,j].plot(0, 0, label="n = {:3.2f}\np = {:3.2f}".format(n, p), alpha=0) ax[i,j].legend(fontsize=12) ax[2,1].set_xlabel('$\\theta$', fontsize=14) ax[1,0].set_ylabel('$p(y|\\theta)$', fontsize=14) ax[0,0].set_xticks(x)

The binomial distribution is also a reasonable choice for the likelihood. Intuitively, we can see that indicates how likely it is that we will obtain a head when tossing a coin, and we have observed that event *y* times. Following the same line of reasoning we get that is the chance of getting a tail, and that event has occurred *N-y* times.

OK, so if we know , the binomial distribution will tell us the expected distribution of heads. The only problem is that we do not know ! But do not despair; in Bayesian statistics, every time we do not know the value of a parameter, we put a prior on it, so let's move on and choose a prior.

As a prior we will use a **beta distribution**, which is a very common distribution in Bayesian statistics and looks like this:

If we look carefully we will see that the beta distribution looks similar to the binomial except for the term with the . This is the Greek uppercase gamma letter and represents what is known as **gamma function**. All that we care about at this point is that the first term is a normalization constant that ensures the distribution integrates to *1* and that the beta distribution has two parameters, and , that control the distribution. Using the following code, we will explore our third distribution so far:

params = [0.5, 1, 2, 3] x = np.linspace(0, 1, 100) f, ax = plt.subplots(len(params), len(params), sharex=True, sharey=True) for i in range(4): for j in range(4): a = params[i] b = params[j] y = stats.beta(a, b).pdf(x) ax[i,j].plot(x, y) ax[i,j].plot(0, 0, label="$\\alpha$ = {:3.2f}\n$\\beta$ = {:3.2f}".format(a, b), alpha=0) ax[i,j].legend(fontsize=12) ax[3,0].set_xlabel('$\\theta$', fontsize=14) ax[0,0].set_ylabel('$p(\\theta)$', fontsize=14)

OK, the beta distribution is nice, but why are we using it for our model? There are many reasons to use a beta distribution for this and other problems. One of them is that the beta distribution is restricted to be between *0* and *1*, in the same way our parameter is. Another reason is its versatility. As we can see in the preceding figure, the distribution adopts several *shapes*, including a uniform distribution, Gaussian-like distributions, U-like distributions, and so on. A third reason is that the beta distribution is the **conjugate prior** of the binomial distribution (which we are using as the likelihood). A conjugate prior of a likelihood is a prior that, when used in combination with the given likelihood, returns a posterior with the same functional form as the prior. Untwisting the tongue, every time we use a beta distribution as prior and a binomial distribution as likelihood, we will get a beta as a posterior. There are other pairs of conjugate priors, for example, the Gaussian distribution is the conjugate prior of itself. For a more detailed discussion read https://en.wikipedia.org/wiki/Conjugate_prior. For many years, Bayesian analysis was restricted to the use of conjugate priors. Conjugacy ensures mathematical tractability of the posterior, which is important given that a common problem in Bayesian statistics is to end up with a posterior we cannot solve analytically. This was a deal breaker before the development of suitable computational methods to solve any possible posterior. From the next chapter on, we will learn how to use modern computational methods to solve Bayesian problems whether we choose conjugate priors or not.

Let's remember Bayes' theorem says that the posterior is proportional to the likelihood times the prior:

So for our problem, we have to multiply the binomial and the beta distributions:

Now let's simplify this expression. To our practical concerns we can drop all the terms that do not depend on and our results will still be valid. So we can write the following:

Reordering it, we get the following:

If we pay attention, we will see that this expression has the same functional form of a beta distribution (except for the normalization) with and , which means that the posterior for our problem is the beta distribution:

Now that we have the analytical expression for the posterior, let's use Python to compute it and plot the results. In the following code you will see there is actually one line that computes the results while the others are there just to plot them:

theta_real = 0.35 trials = [0, 1, 2, 3, 4, 8, 16, 32, 50, 150] data = [0, 1, 1, 1, 1, 4, 6, 9, 13, 48] beta_params = [(1, 1), (0.5, 0.5), (20, 20)] dist = stats.beta x = np.linspace(0, 1, 100) for idx, N in enumerate(trials): if idx == 0: plt.subplot(4,3, 2) else: plt.subplot(4,3, idx+3) y = data[idx] for (a_prior, b_prior), c in zip(beta_params, ('b', 'r', 'g')): p_theta_given_y = dist.pdf(x, a_prior + y, b_prior + N - y) plt.plot(x, p_theta_given_y, c) plt.fill_between(x, 0, p_theta_given_y, color=c, alpha=0.6) plt.axvline(theta_real, ymax=0.3, color='k') plt.plot(0, 0, label="{:d} experiments\n{:d} heads".format(N, y), alpha=0) plt.xlim(0,1) plt.ylim(0,12) plt.xlabel(r'$\theta$') plt.legend() plt.gca().axes.get_yaxis().set_visible(False) plt.tight_layout()

On the first line we have **0 experiments** done, hence these curves are just our priors. We have three curves, one per prior:

The blue one is a uniform prior. This is equivalent to saying that all the possible values for the bias are equally probable a priori.

The red one is similar to the uniform. For the sake of this example we will just say that we are a little bit more confident that the bias is either

*0*or*1*than the rest of the values.The green and last one is centered and concentrated around

*0.5*, so this prior is compatible with information indicating that the coin has more or less about the same chance of landing heads or tails. We could also say this prior is compatible with the*belief*that most coins are fair. While such a word is commonly used in Bayesian discussions we think is better to talk about models that are informed by data.

The rest of the subplots show posteriors for successive experiments. Remember that we can think of posteriors as updated priors given the data. The number of experiments (or coin tosses) and the number of heads are indicated in each subplot's legend. There is also a black vertical line at *0.35* representing the true value for . Of course, in real problems we do not know this value, and it is here just for pedagogical reasons. This figure can teach us a lot about Bayesian analysis, so let's take a moment to understand it:

The result of a Bayesian analysis is the posterior distribution, not a single value but a distribution of plausible values given the data and our model.

The most probable value is given by the mode of the posterior (the peak of the distribution).

The spread of the posterior is proportional to the uncertainty about the value of a parameter; the more spread the distribution, the less certain we are.

- Even when it is easy to see that the uncertainty we have in the first example is larger than in the second one because we have more data that supports our inference. This intuition is reflected in the posterior.
Given a sufficiently large amount of data, two or more Bayesian models with different priors will tend to converge to the same result. In the limit of infinite data, no matter which prior we use, we will always get the same posterior. Remember that infinite is a limit and not a number, so from a practical point of view in some cases the infinite amount of data could be approximated with a really small number of data points.

How fast posteriors converge to the same distribution depends on the data and the model. In the previous figure we can see that the blue and red posteriors look almost indistinguishable after only

*8*experiments, while the red curve continues to be separated from the other two even after*150*experiments.Something not obvious from the figure is that we will get the same result if we update the posterior sequentially than if we do it all at once. We can compute the posterior

*150*times, each time adding one more observation and using the obtained posterior as the new prior, or we can just compute one posterior for the*150*tosses at once. The result will be exactly the same. This feature not only makes perfect sense, also leads to a natural way of updating our estimations when we get new data, a situation common in many data analysis problems.

From the preceding example, it is clear that priors influence the result of the analysis. This is totally fine, priors are supposed to do this. Newcomers to Bayesian analysis (as well as detractors of this paradigm) are in general a little nervous about how to choose priors, because they do not want the prior to act as a censor that does not let the data speak for itself! That's okay, but we have to remember that data does not really speak; at best, data murmurs. Data only makes sense in the light of our models, including mathematical and mental models. There are plenty of examples in the history of science where the same data leads people to think differently about the same topics. Check for example a recent *experiment* that appeared in the New York Times http://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html?_r=0.

Some people fancy the idea of using non-informative priors (also known as flat, vague, or diffuse priors); these priors have the least possible amount of impact on the analysis. While it is possible to use them, in general, we can do better. Throughout this book we will follow the recommendations of Gelman, McElreath, Kruschke and many others, and we will prefer weakly informative priors. For many problems we often know something about the values a parameter can take, we may know that a parameter is restricted to being positive, or we may know the approximate range it can take, or if we expect the value to be close to zero or above/below some value. In such cases, we can use priors to put some weak information in our models without being afraid of being too pushy with our data. Because these priors work to keep the posterior distribution approximately within certain reasonable bounds, they are also know as regularizing priors.

Of course, it can also be possible to use informative priors. These are very strong priors that convey a lot of information. Depending on your problem, it could be easy or not to find this type of prior; for example, in my field of work (structural bioinformatics), people have been using all the prior information they can get, in Bayesian and non-Bayesian ways, to study and especially predict the structure of proteins. This is reasonable because we have been collecting data from thousands of carefully designed experiments for decades and hence we have a great amount of trustworthy prior information at our disposal. Not using it would be absurd! So the take-home message is if you have reliable prior information, there is no reason to discard that information, including the non-nonsensical argument that not using information we trust is objective. Imagine if every time an automotive engineer has to design a new car, she has to start from scratch and re-invent the combustion engine, the wheel, and for that matter, the whole concept of a car. That is not the way things work.

Now we know that there are different kind of priors, but this probably doesn't make us less nervous about choosing among them. Maybe it would be better to not have priors at all. That would make things easier. Well, every model, Bayesian or not has some kind of priors in some way or another, even if the prior does not appear explicitly. In fact many results from frequentist statistics can be seen as special cases of a Bayesian model under certain circumstances, such as flat priors. Let's pay attention to the previous figure one more time. We can see that the mode (the peak of the posterior) of the blue posterior agrees with the expected value for from a frequentist analysis:

Notice that is a point estimate (a number) and not a posterior distribution (or any other type of distribution for that matter). So notice that you can not really avoid priors, but if you include them in your analysis you will get a distribution of plausible values and not only the most probable one. Another advantage of being explicit about priors is that we get more transparent models, meaning more easy to criticize, debug (in a broad sense of the word), and hopefully improve. Building models is an iterative process; sometimes the iteration takes a few minutes, sometimes it could take years. Sometimes it will only involve you and sometimes people you do not even know. Reproducibility matters and transparent assumptions in a model contributes to it.

We are free to use more than one prior (or likelihood) for a given analysis if we are not sure about any special one. Part of the modeling process is about questioning assumptions, and priors are just that. Different assumptions will lead to different models, using data and our domain knowledge of the problem we will be able to compare models. Chapter 06, *Model Comparison* will be devoted to this issue.

Since priors have a central role in Bayesian statistics, we will keep discussing them as we face new problems. So if you have doubts and feel a little bit confused about this discussion just keep calm and don't worry, people have been confused for decades and the discussion is still going on.

Now that we have the posterior, the analysis is finished and we can go home. Well, not yet! We probably need to communicate or summarize the results to others, or even record for later use by ourselves.

If you want to communicate the result, you may need, depending on your audience, to also communicate the model. A common notation to succinctly represent probabilistic models is as follows:

*θ ∼ Beta(α, β)**y ∼ Bin(n = 1, p = θ)*

This is the model we use for the coin-flip example. As we may remember, the symbol *~* indicates that the variable is a random variable distributed according to the distribution on the right, that is, is distributed as a beta distribution with parameters and , and *y* is distributed as a binomial with parameter *n=1* and . The very same model can be represented graphically using Kruschke's diagrams:

On the first level, we have the prior that generates the values for , then the likelihood and, on the last line, the data. Arrows indicate the relationship between variables, and the *~* symbol indicates the stochastic nature of the variables.

All Kruschke's diagrams in the book were made using the templates provided by Rasmus Bååth (http://www.sumsar.net/blog/2013/10/diy-kruschke-style-diagrams/). I would like to specially thanks him for making these templates available.

The result of a Bayesian analysis is the posterior distribution. This contains all the information about our parameters according to the data and the model. If possible, we can just show the posterior distribution to our audience. In general, it is also a good idea to report the mean (or mode or median) of the distribution to have an idea of the location of the distribution and some measure, such as the standard deviation, to have an idea of the dispersion and hence the uncertainty in our estimate. The standard deviation works well for normal-like distributions but can be misleading for other types of distributions, such as skewed ones. So intead, we could use the following approach.

A commonly used device to summarize the spread of a posterior distribution is to use a **Highest Posterior Density** (**HPD**) interval. An HPD is the shortest interval containing a given portion of the probability density. One of the most commonly used is the *95%* HPD or *98%* HPD, often accompanied by the *50%* HPD. If we say that the *95%* HPD for some analysis is *[2-5],* we mean that according to our data and model we think the parameter in question is between *2* and *5* with a *0.95* probability. This is a very intuitive interpretation, to the point that often people misinterpret frequentist confidence intervals as if they were Bayesian credible intervals. If you are familiar with the frequentist paradigm please note that both type of intervals have different interpretations. Performing a fully Bayesian analysis enables us to talk about the probability of a parameter having some value. While this is not possible in the frequentist framework since parameters are fixed by design, a frequentist confidence interval contains or does not contain the true value of a parameter. Another word of caution before we continue: there is nothing special about choosing *95%* or *50%* or any other value. They are just arbitrary commonly used values; we are free to choose the *91.37%* HPD interval if we like. If you want to use the *95%* value, it's OK; just remember that this is just a default value and any justification of which value we should use will be always context-dependent and not automatic.

Computing the *95%* HPD for a unimodal distribution is easy, since it is defined by the percentiles `2.5`

and `97.5`

:

def naive_hpd(post): sns.kdeplot(post) HPD = np.percentile(post, [2.5, 97.5]) plt.plot(HPD, [0, 0], label='HPD {:.2f} {:.2f}'.format(*HPD), linewidth=8, color='k') plt.legend(fontsize=16); plt.xlabel(r'$\theta$', fontsize=14) plt.gca().axes.get_yaxis().set_ticks([]) np.random.seed(1) post = stats.beta.rvs(5, 11, size=1000) naive_hpd(post) plt.xlim(0, 1)

For a multi-modal distribution, the computation of the HPD is a little bit more complicated. If we apply our naive definition of the HPD to a mixture of Gaussians we will get the following:

np.random.seed(1) gauss_a = stats.norm.rvs(loc=4, scale=0.9, size=3000) gauss_b = stats.norm.rvs(loc=-2, scale=1, size=2000) mix_norm = np.concatenate((gauss_a, gauss_b)) naive_hpd(mix_norm)

As we can see in the preceding figure, the HPD computed in the naive way includes values with a low probability, approximately between *[0, 2]*. To compute the HPD in the correct way we will use the function `plot_post`

, which you can download from the accompanying code that comes with the book:

from plot_post import plot_post plot_post(mix_norm, roundto=2, alpha=0.05) plt.legend(loc=0, fontsize=16) plt.xlabel(r"$\theta$", fontsize=14)

As you can see from the preceding figure, the *95% HPD* is composed of two intervals. `plot_post`

also returns the values for the two modes.

One of the nice elements of the Bayesian toolkit is that once we have a posterior, it is possible to use the posterior to generate future data *y*, that is, predictions. Posterior predictive checks consist of comparing the observed data and the predicted data to spot differences between these two sets. The main goal is to check for auto-consistency. The generated data and the observed data should look more or less similar, otherwise there was some problem during the modeling or some problem feeding the data to the model. But even if we did not make any mistake, differences could arise. Trying to understand the mismatch could lead us to improve models or at least to understand their limitations. Knowing which part of our problem/data the model is capturing well and which it is not is valuable information even if we do not know how to improve the model. Maybe the model captures well the mean behavior of our data but fails to predict rare values. This could be problematic for us, or maybe we only care about the mean, so this model will be okay for us. The general aim will be not to declare that a model is false; instead we follow George Box's advice, all models are wrong, but some are useful. We just want to know which part of the model we can trust and try to test whether the model is a good fit for our specific purpose. How confident one can be about a model is certainly not the same across disciplines. Physics can study systems under highly controlled conditions using high-level theories, so models are often seen as good descriptions of reality. Other disciplines such as sociology and biology study complex, difficult to isolate systems, and models have a weaker epistemological status. Nevertheless, independently of which discipline you are working in, models should always be checked and posterior predictive checks together with ideas from exploratory data analysis are a good way to check our models.

The code in the book was written using Python version 3.5, and it is recommended you use the most recent version of Python 3 that is currently available, although most of the code examples may also run for older versions of Python, including Python 2.7, but code could need minor adjustments.

The recommended way to install Python and Python libraries is using Anaconda, a scientific computing distribution. You can read more about Anaconda and download it from https://www.continuum.io/downloads. Once Anaconda is in our system, we can install new Python packages with the following command:

**conda install NamePackage**

We will use the following Python packages:

`Ipython 5.0`

`NumPy 1.11.1`

`SciPy 0.18.1`

`Pandas 0.18.1`

`Matplotlib 1.5.3`

`Seaborn 0.7.1`

`PyMC3 3.0`

To install the latest stable version of PyMC3, run the following command on a command-line terminal:

**pip install pymc3**

We began our Bayesian journey with a very brief discussion about statistical modeling, probability theory and an introduction of the Bayes' theorem. We then use the coin-tossing problem as an excuse to introduce basic aspects of Bayesian modeling and data analysis. We used this classic example to convey some of the most important ideas of Bayesian statistics such as using probability distributions to build models and represent uncertainties. We tried to demystify the use of priors and put them on an equal footing with other elements that we must decide when doing data analysis, such as other parts of the model like the likelihood, or even more meta questions like why are we trying to solve a particular problem in the first place. We ended the chapter discussing the interpretation and communication of the results of a Bayesian analysis. In this chapter we have briefly summarized the main aspects of doing Bayesian data analysis. Throughout the rest of the book we will revisit these ideas to really absorb them and use them as the scaffold of more advanced concepts. In the next chapter we will focus on computational techniques to build and analyze more complex models and we will introduce PyMC3 a Python library that we will use to implement and analyze all our Bayesian models.

We don't know if the brain really works in a Bayesian way, in an approximate Bayesian fashion, or maybe some evolutionary (more or less) optimized heuristics. Nevertheless, we know that we learn by exposing ourselves to data, examples, and exercises. Although you may disagree with this statement given our record as a species on wars, economic-systems that prioritize profit and not people's wellbeing, and other atrocities. Anyway, I strongly recommend you to do the proposed exercises at the end of each chapter:

Modify the code that generated

*figure 3*in order to add a dotted vertical line showing the observed rate head/(number of tosses), compare the location of this line to the mode of the posteriors in each subplot.Try reploting

*figure 3*using other priors (`beta_params`

) and other data (*trials*and*data*).Read about Cromwell's rule at Wikipedia https://en.wikipedia.org/wiki/Cromwell%27s_rule.

Explore different parameters for the Gaussian, binomial and beta plots. Alternatively, you may want to plot a single distribution instead of a grid of distributions.

Read about probabilities and the Dutch book at Wikipedia https://en.wikipedia.org/wiki/Dutch_book.