You're reading from Bayesian Analysis with Python. - Second Edition

Product type Book

Published in Dec 2018

Publisher Packt

ISBN-13 9781789341652

Pages 356 pages

Edition 2nd Edition

Languages

Python

Concepts

Data Science

Author (1):

Osvaldo Martin

Modeling with Linear Regression

"In more than three centuries of science everything has changed except perhaps one thing: the love for the simple."

– Jorge Wagensberg

Music—from classical compositions to Sheena is a Punk Rocker by The Ramones, passing through the unrecognized hit from a garage band and Piazzolla's Libertango—is made from recurring patterns. The same scales, combinations of chords, riffs, motifs, and so on appear over and over again, giving rise to a wonderful sonic landscape capable of eliciting and modulating the entire range of emotions humans can experience. In a similar fashion, the universe of statistics and machine learning (ML) is built upon recurring patterns, small motifs that appear now and again. In this chapter, we are going to look at one of the most popular and useful of them, the linear model (or motif, if you...

Simple linear regression

Many problems we find in science, engineering, and business are of the following form. We have a variable and we want to model/predict a variable . Importantly, these variables are paired like . In the most simple scenario, known as simple linear regression, both and are uni-dimensional continuous random variables. By continuous, we mean a variable represented using real numbers (or floats, if you wish), and using NumPy, you will represent the variables or as one-dimensional arrays. Because this is a very common model, the variables get proper names. We call the variables the dependent, predicted, or outcome variables, and the variables the independent, predictor, or input variables. When is a matrix (we have different variables), we have what is known as multiple linear regression. In this and the following chapter, we will explore these and other...

Robust linear regression

Assuming that the data follows a Gaussian distribution, it is perfectly reasonable in many situations. By assuming Gaussianity, we are not necessarily saying data is really Gaussian; instead, we are saying that it is a reasonable approximation for a given problem. The same applies to other distributions. As we saw in the previous chapter, sometimes, this Gaussian assumption fails, for example, in the presence of outliers. We learned that using a Student's t-distribution is a way to effectively deal with outliers and get a more robust inference. The very same idea can be applied to linear regression.

To exemplify the robustness that a Student's t-distribution brings to a linear regression, we are going to use a very simple and nice dataset: the third data group from the Anscombe quartet. If you do not know what the Anscombe quartet is, remember...

Hierarchical linear regression

In the previous chapter, we learned about the rudiments of hierarchical models. We can apply this concept to linear regression as well. This allows models to deal with inferences at the group level and estimations above the group level. As we already saw, this is done by including hyperpriors.

We are going to create eight related data groups, including one group with a single data point:

N = 20
M = 8
idx = np.repeat(range(M-1), N)
idx = np.append(idx, 7)
np.random.seed(314)

alpha_real = np.random.normal(2.5, 0.5, size=M)
beta_real = np.random.beta(6, 1, size=M)
eps_real = np.random.normal(0, 0.5, size=len(idx))

y_m = np.zeros(len(idx))
x_m = np.random.normal(10, 1, len(idx))
y_m = alpha_real[idx] + beta_real[idx] * x_m + eps_real

_, ax = plt.subplots(2, 4, figsize=(10, 5), sharex=True, sharey=True)
ax = np.ravel(ax)
j, k = 0, N
for i in range(M):
    ax[i].scatter...

Polynomial regression

I hope you are excited about the skills you have learned about so far in this chapter. Now, we are going to learn how to fit curves using linear regression. One way to fit curves using a linear regression model is by building a polynomial, like this:

If we pay attention, we can see that the simple linear model is hidden in this polynomial. To uncover it, all we need to do is to make all the coefficients higher than one exactly zero. Then, we will get:

Polynomial regression is still linear regression; the linearity in the model is related to how the parameters enter the model, not the variables. Let's try building a polynomial regression of degree 2:

The third term controls the curvature of the relationship.

As a dataset, we are going to use the second group of the Anscombe quartet:

x_2 = ans[ans.group == 'II']['x'].values
y_2 ...

Multiple linear regression

So far, we have been working with one dependent variable and one independent variable. Nevertheless, it is not unusual to have several independent variables that we want to include in our model. Some examples could be:

Perceived quality of wine (dependent) and acidity, density, alcohol level, residual sugar, and sulphates content (independent variables)
A student's average grades (dependent) and family income, distance from home to school, and mother's education (categorical variable)

We can easily extend the simple linear regression model to deal with more than one independent variable. We call this model multiple linear regression or less often multivariable linear regression (not to be confused with multivariate linear regression, the case where we have multiple dependent variables).

In a multiple linear regression model, we model the...

Variable variance

We have been using the linear motif to model the mean of a distribution and, in the previous section, we used it to model interactions. We can also use it to model the variance (or standard deviation) when the assumptions of constant variance do not make sense. For those cases, we may want to consider the variance as a (linear) function of the independent variable.

The World Health Organization (WHO) and other health institutions around the world collect data for newborns and toddlers and design growth charts standards. These charts are an essential component of the paediatric toolkit and also a measure of the general well-being of populations in order to formulate health-related policies, plan interventions, and monitor their effectiveness (http://www.who.int/childgrowth/en/).

An example of such data is the lengths (heights) of newborn/toddlers girls as a function...

Summary

A simple linear regression is a model that can be used to predict and/or explain one variable from another one. Using machine learning language, this is a case of supervised learning. From a probabilitic perspective, a linear regression model is an extension of the Gaussian model where the mean is not directly estimated but rather computed as a linear function of a predictor variable and some additional parameters. While the Gaussian distribution is the most common choice for the dependent variable, we are free to choose other distributions. One alternative, which is especially useful when dealing with potential outliers, is the Student's t-distribution. In the next chapter, we will explore other alternatives.

In this chapter, we also discussed the Pearson correlation coefficient, the most common measure of linear correlation between two variables, and we learned...

Exercises

Check the following definition of a probabilistic model. Identify the likelihood, the prior, and the posterior:

For the model in exercise 1, how many parameters have the posterior? In other words, how many dimensions does it have?
Write down Bayes' theorem for the model in exercise 1.

Check the following model. Identify the linear model and identify the likelihood. How many parameters does the posterior have?

For the model in exercise 1, assume that you have a dataset with 57 data points coming from a Gaussian with a mean of 4 and a standard deviation of 0.5. Using PyMC3, compute:
- The posterior distribution
- The prior distribution
- The posterior predictive distribution
- The prior predictive distribution

Tip: Besides pm.sample(), PyMC3 has other functions to compute samples.