You're reading from The Statistics and Machine Learning with R Workshop

Product typeBook

Published inOct 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781803240305

Edition1st Edition

Languages

Concepts

Machine Learning

Author (1)

Liu Peng

Statistical Estimation

In this chapter, we will introduce you to a range of statistical techniques that enable you to make inferences and estimations using both numerical and categorical data. We will explore key concepts and methods, such as hypothesis testing, confidence intervals, and estimation techniques, that empower us to make generalizations about populations from a given sample.

By the end of this chapter, you will grasp the core concepts of statistical inference and be able to perform hypothesis testing in different scenarios.

In this chapter, we will cover the following main topics:

Statistical inference for categorical data
Statistical inference for numerical data
Constructing the bootstrapped confidence interval
Introducing the central limit theorem used in t-distribution
Constructing the confidence interval for the population mean using the t-distribution
Performing hypothesis testing for two means
Introducing ANOVA

To run...

Statistical inference for categorical data

A categorical variable has distinct categories or levels, rather than numerical values. Categorical data is common in our daily lives, such as gender (male or female, although a modern view may differ), type of property sales (new property or resale), and industry. The ability to make sound inferences about these variables is thus essential for drawing meaningful conclusions and making well-informed decisions in diverse contexts.

Being a categorical variable often means we cannot pass it to a machine learning (ML) model without additional preprocessing. Take the industry variable, for example. Instead of passing the categorical values (string values such as "finance" or "technology") to the model, a common approach is to one-hot encode the variable into multiple columns, with each column corresponding to a specific industry, indicating a binary value of 0 or 1.

In this section, we will explore various statistical...

Statistical inference for numerical data

In this section, we will switch to look at statistical inference using numerical data. We will cover two approaches. The first approach relies on the bootstrapping procedure and permutes the original dataset to create additional artificial datasets, which can then be used to derive the confidence intervals. The second approach uses a theoretical assumption on the distribution of the bootstrapped samples and relies on the t-distribution to achieve the same result. We will learn how to perform a t-test, derive a confidence interval, and conduct an analysis of variance (ANOVA).

As discussed earlier, bootstrapping is a non-parametric resampling method that allows us to estimate the sampling distribution of a particular statistic, such as the mean, median, or proportion, as in the previous section. This is achieved by repeatedly drawing random samples with replacement from the original data. By doing so, we can calculate confidence intervals and...

Constructing the bootstrapped confidence interval

We have looked at how to construct the bootstrapped confidence interval using the standard error method. This involves adding and subtracting the scaled standard error from the observed sample statistic. It turns out that there is another, simpler method, which just uses the percentile of the bootstrap distribution to obtain the confidence interval.

Let us continue with the previous example. Say we would like to calculate the 95% confidence interval of the previous bootstrap distribution. We can achieve this by calculating the upper and lower quantiles (97.5% and 2.5%, respectively) of the bootstrap distribution. The following code achieves this:

>>> bs %>%
  summarize(
    l = quantile(stat, 0.025),
    u = quantile(stat, 0.975)
  )
# A tibble: 1 × 2
      l     u
  <dbl> <dbl...

Introducing the central limit theorem used in t-distribution

The CLT says that the distribution from the sum (or average) of many independent and identically distributed random variables would jointly form a normal distribution, regardless of the underlying distribution of these individual variables. Due to the CLT, normal distribution is often used to approximate the sampling distribution of various statistics, such as the sample mean and the sample proportion.

The t-distribution is related to the CLT in the context of statistical inference. When we’re estimating a population mean from a sample, we often have no access to the true standard deviation of the population. Instead, we resort to the sample standard deviation as an estimate. In this case, the sampling distribution of the sample mean doesn’t follow a normal distribution, but rather a t-distribution. In other words, when we extract the sample mean from a set of observed samples, and we are unsure of the population...

Constructing the confidence interval for the population mean using the t-distribution

Let us review the process of statistical inference for the population mean. We start with a limited sample, from which we can derive the sample mean. Since we want to estimate the population mean, we would like to perform statistical inference based on the observed sample mean and quantify the range where the population statistic may exist.

For example, the average miles per gallon, shown in the following code, is around 20 in the mtcars dataset:

>>> mean(mtcars$mpg)
20.09062

Given this result, we won’t be surprised to encounter another similar dataset with an average mpg of 19 or 21. However, we would be surprised if the value is 5, 50, or even 100. When assessing a new collection of samples, we need a way to quantify the variability of the sample mean across multiple samples. We have learned two ways to do this: use the bootstrap approach to simulate artificial samples or...

Performing hypothesis testing for two means

In this section, we will explore the process of comparing two sample means using hypothesis testing. When comparing two sample means, we want to determine whether a significant difference exists between the means of two distinct populations or groups.

Suppose now we have two groups of samples. These two groups could represent a specific value before and after treatment for each sample. Our objective is thus to compare the sample statistics of these two groups, such as the sample mean, and determine whether the treatment has an effect. To do this, we can perform a hypothesis test to compare mean values from the two independent distributions using either bootstrap simulation or t-test approximation.

When using the t-test in the hypothesis test to compare the mean values of two independent samples, the two-sample t-test assumes normal distribution for the data, and that the variances of the two populations are equal. However, in cases...

Introducing ANOVA

ANOVA is a statistical hypothesis testing method used to compare the means of more than two groups, which extends the two-sample t-test discussed in the previous section. The goal of ANOVA is to test potential significant differences among the group means (the between-group variability) while accounting for the variability within each group (the within-group variability).

ANOVA relies on the F-statistic in hypothesis testing. The F-statistic is a ratio of two estimates of variance: the between-group variance and the within-group variance. The between-group variance measures the differences among the group means, while the within-group variance represents the variability within each group. The F-statistic can be calculated based on these two group variances.

In hypothesis testing, the null hypothesis for ANOVA states that all group means are equal, and any observed differences are due to chance. The alternative hypothesis, on the other hand, suggests that at...

Summary

In this chapter, we covered different types of statistical inferences for hypothesis testing, targeting both numerical and categorical data. We introduced inference methods for a single variable, two variables, and multiple variables, using either proportion (for categorical variable) or mean (for numerical variable) as the sample statistic. The hypothesis testing procedure, including both the parametric approach using model-based approximation and the non-parametric approach using bootstrap-based simulations, offers valuable tools such as the confidence interval and p-value. These tools allow us to make a decision about whether we can reject the null hypothesis in favor of the alternative hypothesis. Such a decision also relates to the Type I and Type II errors.

In the next chapter, we will cover one of the most widely used statistical and ML models: linear regression.

The rest of the chapter is locked

You have been reading a chapter from

The Statistics and Machine Learning with R Workshop

Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages