Reader small image

You're reading from  The Statistics and Machine Learning with R Workshop

Product typeBook
Published inOct 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781803240305
Edition1st Edition
Languages
Right arrow
Author (1)
Liu Peng
Liu Peng
author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng

Right arrow

Statistical Estimation

In this chapter, we will introduce you to a range of statistical techniques that enable you to make inferences and estimations using both numerical and categorical data. We will explore key concepts and methods, such as hypothesis testing, confidence intervals, and estimation techniques, that empower us to make generalizations about populations from a given sample.

By the end of this chapter, you will grasp the core concepts of statistical inference and be able to perform hypothesis testing in different scenarios.

In this chapter, we will cover the following main topics:

  • Statistical inference for categorical data
  • Statistical inference for numerical data
  • Constructing the bootstrapped confidence interval
  • Introducing the central limit theorem used in t-distribution
  • Constructing the confidence interval for the population mean using the t-distribution
  • Performing hypothesis testing for two means
  • Introducing ANOVA

To run...

Statistical inference for categorical data

A categorical variable has distinct categories or levels, rather than numerical values. Categorical data is common in our daily lives, such as gender (male or female, although a modern view may differ), type of property sales (new property or resale), and industry. The ability to make sound inferences about these variables is thus essential for drawing meaningful conclusions and making well-informed decisions in diverse contexts.

Being a categorical variable often means we cannot pass it to a machine learning (ML) model without additional preprocessing. Take the industry variable, for example. Instead of passing the categorical values (string values such as "finance" or "technology") to the model, a common approach is to one-hot encode the variable into multiple columns, with each column corresponding to a specific industry, indicating a binary value of 0 or 1.

In this section, we will explore various statistical...

Statistical inference for numerical data

In this section, we will switch to look at statistical inference using numerical data. We will cover two approaches. The first approach relies on the bootstrapping procedure and permutes the original dataset to create additional artificial datasets, which can then be used to derive the confidence intervals. The second approach uses a theoretical assumption on the distribution of the bootstrapped samples and relies on the t-distribution to achieve the same result. We will learn how to perform a t-test, derive a confidence interval, and conduct an analysis of variance (ANOVA).

As discussed earlier, bootstrapping is a non-parametric resampling method that allows us to estimate the sampling distribution of a particular statistic, such as the mean, median, or proportion, as in the previous section. This is achieved by repeatedly drawing random samples with replacement from the original data. By doing so, we can calculate confidence intervals and...

Constructing the bootstrapped confidence interval

We have looked at how to construct the bootstrapped confidence interval using the standard error method. This involves adding and subtracting the scaled standard error from the observed sample statistic. It turns out that there is another, simpler method, which just uses the percentile of the bootstrap distribution to obtain the confidence interval.

Let us continue with the previous example. Say we would like to calculate the 95% confidence interval of the previous bootstrap distribution. We can achieve this by calculating the upper and lower quantiles (97.5% and 2.5%, respectively) of the bootstrap distribution. The following code achieves this:

>>> bs %>%
  summarize(
    l = quantile(stat, 0.025),
    u = quantile(stat, 0.975)
  )
# A tibble: 1 × 2
      l     u
  <dbl> <dbl...

Introducing the central limit theorem used in t-distribution

The CLT says that the distribution from the sum (or average) of many independent and identically distributed random variables would jointly form a normal distribution, regardless of the underlying distribution of these individual variables. Due to the CLT, normal distribution is often used to approximate the sampling distribution of various statistics, such as the sample mean and the sample proportion.

The t-distribution is related to the CLT in the context of statistical inference. When we’re estimating a population mean from a sample, we often have no access to the true standard deviation of the population. Instead, we resort to the sample standard deviation as an estimate. In this case, the sampling distribution of the sample mean doesn’t follow a normal distribution, but rather a t-distribution. In other words, when we extract the sample mean from a set of observed samples, and we are unsure of the population...

Constructing the confidence interval for the population mean using the t-distribution

Let us review the process of statistical inference for the population mean. We start with a limited sample, from which we can derive the sample mean. Since we want to estimate the population mean, we would like to perform statistical inference based on the observed sample mean and quantify the range where the population statistic may exist.

For example, the average miles per gallon, shown in the following code, is around 20 in the mtcars dataset:

>>> mean(mtcars$mpg)
20.09062

Given this result, we won’t be surprised to encounter another similar dataset with an average mpg of 19 or 21. However, we would be surprised if the value is 5, 50, or even 100. When assessing a new collection of samples, we need a way to quantify the variability of the sample mean across multiple samples. We have learned two ways to do this: use the bootstrap approach to simulate artificial samples or...

Performing hypothesis testing for two means

In this section, we will explore the process of comparing two sample means using hypothesis testing. When comparing two sample means, we want to determine whether a significant difference exists between the means of two distinct populations or groups.

Suppose now we have two groups of samples. These two groups could represent a specific value before and after treatment for each sample. Our objective is thus to compare the sample statistics of these two groups, such as the sample mean, and determine whether the treatment has an effect. To do this, we can perform a hypothesis test to compare mean values from the two independent distributions using either bootstrap simulation or t-test approximation.

When using the t-test in the hypothesis test to compare the mean values of two independent samples, the two-sample t-test assumes normal distribution for the data, and that the variances of the two populations are equal. However, in cases...

Introducing ANOVA

ANOVA is a statistical hypothesis testing method used to compare the means of more than two groups, which extends the two-sample t-test discussed in the previous section. The goal of ANOVA is to test potential significant differences among the group means (the between-group variability) while accounting for the variability within each group (the within-group variability).

ANOVA relies on the F-statistic in hypothesis testing. The F-statistic is a ratio of two estimates of variance: the between-group variance and the within-group variance. The between-group variance measures the differences among the group means, while the within-group variance represents the variability within each group. The F-statistic can be calculated based on these two group variances.

In hypothesis testing, the null hypothesis for ANOVA states that all group means are equal, and any observed differences are due to chance. The alternative hypothesis, on the other hand, suggests that at...

Summary

In this chapter, we covered different types of statistical inferences for hypothesis testing, targeting both numerical and categorical data. We introduced inference methods for a single variable, two variables, and multiple variables, using either proportion (for categorical variable) or mean (for numerical variable) as the sample statistic. The hypothesis testing procedure, including both the parametric approach using model-based approximation and the non-parametric approach using bootstrap-based simulations, offers valuable tools such as the confidence interval and p-value. These tools allow us to make a decision about whether we can reject the null hypothesis in favor of the alternative hypothesis. Such a decision also relates to the Type I and Type II errors.

In the next chapter, we will cover one of the most widely used statistical and ML models: linear regression.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Statistics and Machine Learning with R Workshop
Published in: Oct 2023Publisher: PacktISBN-13: 9781803240305
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Liu Peng

Peng Liu is an Assistant Professor of Quantitative Finance (Practice) at Singapore Management University and an adjunct researcher at the National University of Singapore. He holds a Ph.D. in statistics from the National University of Singapore and has ten years of working experience as a data scientist across the banking, technology, and hospitality industries.
Read more about Liu Peng