Reader small image

You're reading from  Mastering pandas. - Second Edition

Product typeBook
Published inOct 2019
Reading LevelIntermediate
Publisher
ISBN-139781789343236
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Ashish Kumar
Ashish Kumar
author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

Right arrow

A Brief Tour of Bayesian Statistics and Maximum Likelihood Estimates

In this chapter, we take a brief tour of an alternative approach to statistical inference called Bayesian statistics. It is not intended to be a full primer, but will simply serve as an introduction to the Bayesian approach. We will also explore the associated Python-related libraries and learn how to use pandas and matplotlib to help with the data analysis. The various topics that will be discussed are as follows:

  • Introduction to Bayesian statistics
  • The mathematical framework for Bayesian statistics
  • Probability distributions
  • Bayesian versus frequentist statistics
  • Introduction to PyMC and Monte Carlo simulations
  • Bayesian analysis example – switchpoint detection

Introduction to Bayesian statistics

The field of Bayesian statistics is built on the work of Reverend Thomas Bayes, an 18th-century statistician, philosopher, and Presbyterian minister. His famous Bayes' theorem, which forms the theoretical underpinnings of Bayesian statistics, was published posthumously in 1763 as a solution to the problem of inverse probability. For more details on this topic, refer to http://en.wikipedia.org/wiki/Thomas_Bayes.

Inverse probability problems were all the rage in the early 18th century, and were often formulated as follows.

Suppose you play a game with a friend. There are 10 green balls and 7 red balls in bag 1 and 4 green balls and 7 red balls in bag 2. Your friend tosses a coin (without telling you the result), picks a ball from one of the bags at random, and shows it to you. The ball is red. What is the probability that the ball was drawn...

The mathematical framework for Bayesian statistics

Bayesian methods are an alternative way of making a statistical inference. We will first look at Bayes' theorem, the fundamental equation from which all Bayesian inference is derived.

A few definitions regarding probability are in order before we begin:

  • A,B: These are events that can occur with a certain probability.
  • P(A) and P(B): This is the probability of the occurrence of a particular event.
  • P(A|B): This is the probability of A happening, given that B has occurred. This is known as a conditional probability.
  • P(AB) = P(A and B): This is the probability of A and B occurring together.

We begin with the following basic assumption:

P(AB) = P(B) * P(A|B)

The preceding equation shows the relation of the joint probability of P(AB) to the conditional probability P(A|B) and what is known as the marginal probability, P(B). If...

Probability distributions

In this section, we will briefly examine the properties of various probability distributions. Many of these distributions are used for Bayesian analysis, and so a brief synopsis is needed before we can proceed. We will also illustrate how to generate and display these distributions using matplotlib. In order to avoid repeating the import statements for every code snippet in each section, we will be presenting the following standard set of Python code imports that need to be run before any of the code snippets mentioned in the following command. You only need to run these imports once per session. The imports are as follows:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
  
...

Bayesian statistics versus frequentist statistics

In statistics today, there are two schools of thought as to how we interpret data and make statistical inferences. The classical and more dominant approach to date has been what is termed the frequentist approach (refer to Chapter 7, A Tour of Statistics – The Classical Approach). We are looking at the Bayesian approach in this chapter.

What is probability?

At the heart of the debate between the Bayesian and frequentist worldview is the question of how we define probability.

In the frequentist worldview, probability is a notion that is derived from the frequencies of repeated events—for example, when we define the probability of getting heads when a fair coin...

Conducting Bayesian statistical analysis

Conducting a Bayesian statistical analysis involves the following steps:

  1. Specifying a probability model: In this step, we fully describe the model using a probability distribution. Based on the distribution of a sample that we have taken, we try to fit a model to it and attempt to assign probabilities to unknown parameters.
  2. Calculating a posterior distribution: The posterior distribution is a distribution that we calculate in light of observed data. In this case, we will directly apply Bayes' formula. It will be specified as a function of the probability model that we specified in the previous step.

  1. Checking our model: This is a necessary step where we review our model and its outputs before we make inferences. Bayesian inference methods use probability distributions to assign probabilities to possible outcomes.
...

Monte Carlo estimation of the likelihood function and PyMC

Bayesian statistics isn't just another method. It is an entirely different paradigm for practicing statistics. It uses probability models for making inferences, given the data that has been collected. This can be expressed in a fundamental expression as P(H|D).

Here, H is our hypothesis, that is, the thing we're trying to prove, and D is our data or observations.

As a reminder of our previous discussion, the diachronic form of Bayes' theorem is as follows:

Here, P(H) is an unconditional prior probability that represents what we know before we conduct our trial. P(D|H) is our likelihood function, or probability of obtaining the data we observe, given that our hypothesis is true.

P(D) is the probability of the data, also known as the normalizing constant. This can be obtained by integrating the numerator...

References

For a more in-depth look at other Bayesian statistics topics that we touched upon, please take a look at the following references:

Summary

In this chapter, we undertook a whirlwind tour of one of the hottest trends in statistics and data analysis in the past few years—the Bayesian approach to statistical inference. We covered a lot of ground here.

We examined what the Bayesian approach to statistics entails and discussed the various reasons why the Bayesian view is a compelling one, such as the fact that it values facts over belief. We explained the key statistical distributions and showed how we can use the various statistical packages to generate and plot them in matplotlib.

We tackled a rather difficult topic without too much oversimplification and demonstrated how we can use the PyMC package and Monte Carlo simulation methods to showcase the power of Bayesian statistics to formulate models, perform trend analysis, and make inferences on a real-world dataset (Facebook user posts). The concept of...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering pandas. - Second Edition
Published in: Oct 2019Publisher: ISBN-13: 9781789343236
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar