Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Probability Basics

Probability distribution is an essential concept in statistics and machine learning. It describes the underlying distribution that governs the generation of potential outcomes or events in an experiment or random process. There are different types of probability distributions, depending on the specific domain and characteristics of the data. A proper probability distribution is a useful tool in understanding and modeling the behavior of random processes and events, providing convenient tools for decision-making and predictions when developing data-driven predictive and optimization models.

By the end of this chapter, you will understand the common probability distributions and their parameters. You will also be able to use these probability distributions to perform usual tasks such as sampling and probability calculations in R, as well as common sampling distribution and order statistics.

In this chapter, we will cover the following topics:

  • Introducing...

Technical requirements

To run the code in this chapter, you will need to have the latest versions of the following packages:

  • ggplot2, 3.4.0
  • dplyr, 1.0.10

Please note that the versions of the packages mentioned in the preceding list are the latest ones at the time of writing this chapter.

The code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/blob/main/Chapter_10/working.R.

Introducing probability distribution

Probability distribution provides a framework for understanding and predicting the behavior of random variables. Once we know the underlying data-generating probability distribution, we can make more informed decisions about how things are likely to appear, either in a predictive or optimization context. In other words, if the selected probability distribution can model the observed data very well, we have a powerful tool to predict potential future values, as well as the uncertainty of such occurrence.

Here, a random variable is a variable whose value is not fixed and may assume multiple or infinitely many possible values, representing the outcomes (or realizations) of a random event. Probability distributions allow us to represent and analyze the probability of these outcomes, offering a comprehensive view of the underlying uncertainties in various scenarios. A probability distribution takes the random variable, denoted as x, and converts it...

Exploring common discrete probability distributions

Discrete probability distributions are characterized by their corresponding PMFs, which assign a probability to each possible outcome of the input random variable. The sum of the probabilities for all possible outcomes in a discrete distribution equals 1, leading to  i=1 C  f( x i) = 1. This also means that one of the outcomes must occur, giving f(x i) > 0, i = 1, , C.

Discrete probability distributions are vital in various fields, such as finance. They are commonly used for statistical analyses, including hypothesis testing, parameter estimation, and predictive modeling. We can use discrete probability distributions to quantify uncertainties, make predictions, and gain insights into the underlying data-generating process of the observed outcomes.

Let’s start with the most fundamental discrete distribution: the Bernoulli distribution.

The Bernoulli distribution...

Discovering common continuous probability distributions

Continuous probability distributions model the probability of random variables that assume any value within a specific continuous range. In other words, the underlying random variable is continuous instead of discrete. These distributions describe the probabilities of observing values that fall within a continuous interval, rather than equal to individual discrete outcomes in a discrete probability distribution. Specifically, in a continuous probability distribution, the probability of the random variable equal to any specific value is typically zero, since the possible outcomes are uncountable. Instead, probabilities for continuous distributions are calculated for intervals or ranges of values.

We can use a PDF to describe a continuous distribution. This corresponds to the PMF of a discrete probability distribution. The PDF defines the probability of observing a value within an infinitesimally small interval around a given...

Understanding common sampling distributions

A sampling distribution is a probability distribution of a sample statistic based on many samples drawn from a population. In other words, it is the distribution of a particular statistic (such as the mean, median, or proportion) calculated from many sets of samples from the same population, where each set has the same size. There are two things to take note of here. First, the sampling distribution is not about the random samples drawn from the PDF. Instead, it is a distribution that’s made from an aggregate statistic, which comes from another distribution drawn from the PDF. Second, we would need to sample from the PDF in multiple rounds to create the sampling distribution, where each round consists of multiple samples from the PDF.

Let’s look at an exercise in R to illustrate the concept of the sampling distribution using the sample mean as the statistic of interest. We will generate samples from a population whose distribution...

Understanding order statistics

Order statistics are the values of a collection of samples when arranged in ascending or descending order. These ordered samples provide useful information about the distribution and characteristics of the sampled data. Usually, the k th order statistic is the k th smallest value in the sorted sample.

For example, for a collection of samples of size n, the order statistics are denoted as X 1, X 2, , X n, where X 1 is the smallest value (the minimum), X n is the largest value (the maximum), and X k represents the k th smallest value in the sorted sample.

Let’s look at how to extract order statistics in R.

Extracting order statistics

Extracting the order statistics of a collection of samples could involve two types of tasks. We may be interested in collecting samples in an ordered fashion, which can be achieved using the sort() function. Alternatively, we may be interested in extracting...

Summary

In this chapter, we covered common probability distributions. We started by introducing discrete probability distributions, including the Bernoulli distribution, the binomial distribution, the Poisson distribution, and the geometric distribution. We followed by covering common continuous probability distributions, including the normal distribution, the exponential distribution, and the uniform distribution. Next, we introduced common sampling distributions and their use in statistical inferences for population statistics. Finally, we covered order statistics and their use in calculating the VaR in the context of daily stock returns.

In the next chapter, we will cover statistical estimation procedures, including point estimation, the central limit theorem, and the confidence interval.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng