You're reading from The Statistics and Machine Learning with R Workshop

Product typeBook

Published inOct 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781803240305

Edition1st Edition

Languages

Concepts

Machine Learning

Author (1)

Liu Peng

Probability Basics

Probability distribution is an essential concept in statistics and machine learning. It describes the underlying distribution that governs the generation of potential outcomes or events in an experiment or random process. There are different types of probability distributions, depending on the specific domain and characteristics of the data. A proper probability distribution is a useful tool in understanding and modeling the behavior of random processes and events, providing convenient tools for decision-making and predictions when developing data-driven predictive and optimization models.

By the end of this chapter, you will understand the common probability distributions and their parameters. You will also be able to use these probability distributions to perform usual tasks such as sampling and probability calculations in R, as well as common sampling distribution and order statistics.

In this chapter, we will cover the following topics:

Introducing...

Technical requirements

To run the code in this chapter, you will need to have the latest versions of the following packages:

ggplot2, 3.4.0
dplyr, 1.0.10

Please note that the versions of the packages mentioned in the preceding list are the latest ones at the time of writing this chapter.

The code and data for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/blob/main/Chapter_10/working.R.

Introducing probability distribution

Probability distribution provides a framework for understanding and predicting the behavior of random variables. Once we know the underlying data-generating probability distribution, we can make more informed decisions about how things are likely to appear, either in a predictive or optimization context. In other words, if the selected probability distribution can model the observed data very well, we have a powerful tool to predict potential future values, as well as the uncertainty of such occurrence.

Here, a random variable is a variable whose value is not fixed and may assume multiple or infinitely many possible values, representing the outcomes (or realizations) of a random event. Probability distributions allow us to represent and analyze the probability of these outcomes, offering a comprehensive view of the underlying uncertainties in various scenarios. A probability distribution takes the random variable, denoted as x, and converts it...

Exploring common discrete probability distributions

Discrete probability distributions are characterized by their corresponding PMFs, which assign a probability to each possible outcome of the input random variable. The sum of the probabilities for all possible outcomes in a discrete distribution equals 1, leading to ∑ i=1 C f( x i) = 1. This also means that one of the outcomes must occur, giving f(x i) > 0, ∀ i = 1, … , C.

Discrete probability distributions are vital in various fields, such as finance. They are commonly used for statistical analyses, including hypothesis testing, parameter estimation, and predictive modeling. We can use discrete probability distributions to quantify uncertainties, make predictions, and gain insights into the underlying data-generating process of the observed outcomes.

Let’s start with the most fundamental discrete distribution: the Bernoulli distribution.

The Bernoulli distribution...

Discovering common continuous probability distributions

Continuous probability distributions model the probability of random variables that assume any value within a specific continuous range. In other words, the underlying random variable is continuous instead of discrete. These distributions describe the probabilities of observing values that fall within a continuous interval, rather than equal to individual discrete outcomes in a discrete probability distribution. Specifically, in a continuous probability distribution, the probability of the random variable equal to any specific value is typically zero, since the possible outcomes are uncountable. Instead, probabilities for continuous distributions are calculated for intervals or ranges of values.

We can use a PDF to describe a continuous distribution. This corresponds to the PMF of a discrete probability distribution. The PDF defines the probability of observing a value within an infinitesimally small interval around a given...

Understanding common sampling distributions

A sampling distribution is a probability distribution of a sample statistic based on many samples drawn from a population. In other words, it is the distribution of a particular statistic (such as the mean, median, or proportion) calculated from many sets of samples from the same population, where each set has the same size. There are two things to take note of here. First, the sampling distribution is not about the random samples drawn from the PDF. Instead, it is a distribution that’s made from an aggregate statistic, which comes from another distribution drawn from the PDF. Second, we would need to sample from the PDF in multiple rounds to create the sampling distribution, where each round consists of multiple samples from the PDF.

Let’s look at an exercise in R to illustrate the concept of the sampling distribution using the sample mean as the statistic of interest. We will generate samples from a population whose distribution...

Understanding order statistics

Order statistics are the values of a collection of samples when arranged in ascending or descending order. These ordered samples provide useful information about the distribution and characteristics of the sampled data. Usually, the k th order statistic is the k th smallest value in the sorted sample.

For example, for a collection of samples of size n, the order statistics are denoted as X 1, X 2, … , X n, where X 1 is the smallest value (the minimum), X n is the largest value (the maximum), and X k represents the k th smallest value in the sorted sample.

Let’s look at how to extract order statistics in R.

Extracting order statistics

Extracting the order statistics of a collection of samples could involve two types of tasks. We may be interested in collecting samples in an ordered fashion, which can be achieved using the sort() function. Alternatively, we may be interested in extracting...

Summary

In this chapter, we covered common probability distributions. We started by introducing discrete probability distributions, including the Bernoulli distribution, the binomial distribution, the Poisson distribution, and the geometric distribution. We followed by covering common continuous probability distributions, including the normal distribution, the exponential distribution, and the uniform distribution. Next, we introduced common sampling distributions and their use in statistical inferences for population statistics. Finally, we covered order statistics and their use in calculating the VaR in the context of daily stock returns.

In the next chapter, we will cover statistical estimation procedures, including point estimation, the central limit theorem, and the confidence interval.

The rest of the chapter is locked

You have been reading a chapter from

The Statistics and Machine Learning with R Workshop

Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages