You're reading from Building Statistical Models in Python

Product typeBook

Published inAug 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781804614280

Edition1st Edition

Languages

Python

Concepts

Statistics

Authors (3):

Huy Hoang Nguyen

Paul N Adams

Stuart J Miller

View More author details

Population inference from samples

When using a statistical model to make inferential conclusions about a population from a sample subset of that population, the study design must account for similar degrees of uncertainty in its variables as those in the population. This is the variation mentioned earlier in this chapter. To appropriately draw inferential conclusions about a population, any statistical model must be structured around a chance mechanism. Studies structured around these chance mechanisms are called randomized experiments and provide an understanding of both correlation and causation.

Randomized experiments

There are two primary characteristics of a randomized experiment:

Random sampling, colloquially referred to as random selection
Random assignment of treatments, which is the nature of the study

Random sampling

Random sampling (also called random selection) is designed with the intent of creating a sample representative of the overall population so that statistical models generalize the population well enough to assign cause-and-effect outcomes. In order for random sampling to be successful, the population of interest must be well defined. All samples taken from the population must have a chance of being selected. In considering the example of polling voters, all voters must be willing to be polled. Once all voters are entered into a lottery, random sampling can be used to subset voters for modeling. Sampling from only voters who are willing to be polled introduces sampling bias into statistical modeling, which can lead to skewed results. The sampling method in the scenario where only some voters are willing to participate is called self-selection. Any information obtained and modeled from self-selected samples – or any non-random samples – cannot be used for inference.

Random assignment of treatments

The random assignment of treatments refers to two motivators:

The first motivator is to gain an understanding of specific input variables and their influence on the response – for example, understanding whether assigning treatment A to a specific individual may produce more favorable outcomes than a placebo.
The second motivator is to remove the impact of external variables on the outcomes of a study. These external variables, called confounding variables (or confounders), are important to remove as they often prove difficult to control. They may have unpredictable values or even be unknown to the researcher. The consequence of including confounders is that the outcomes of a study may not be replicable, which can be costly. While confounders can influence outcomes, they can also influence input variables, as well as the relationships between those variables.

Referring back to the example in the earlier section, Population versus sample, consider a farmer who decides to start using pesticides on his crops and wants to test two different brands. The farmer knows there are three distinct areas of the land; plot A, plot B, and plot C. To determine the success of the pesticides and prevent damage to the crops, the farmer randomly chooses 60 plants from each plot (this is called stratified random sampling where random sampling is stratified across each plot) for testing. This selection is representative of the population of plants. From this selection, the farmer labels his plants (labeling doesn’t need to be random). For each plot, the farmer shuffles the labels into a bag, to randomize them, and begins selecting 30 plants. The first 30 plants get one of two treatments and the other 30 are given the other treatment. This is a random assignment of treatment. Assuming the three separate plots represent a distinct set of confounding variables on crop yield, the farmer will have enough information to obtain an inference about the crop yield for each pesticide brand.

Observational study

The other type of statistical study often performed is an observational study, in which the researcher seeks to learn through observing data that already exists. An observational study can aid in the understanding of input variables and their relationships to both the target and each other, but cannot provide cause-and-effect understanding as a randomized experiment can. An observational study may have one of the two components of a randomized experiment – either random sampling or random assignment of treatment – but without both components, will not directly yield inference. There are many reasons why an observational study may be performed versus a randomized experiment, such as the following:

A randomized experiment being too costly
Ethical constraints for an experiment (for example, an experiment to determine the rate of birth defects caused by smoking while pregnant)
Using data from prior randomized experiments, which thus removes the need for another experiment

One method for deriving some causality from an observational study is to perform random sampling and repeated analysis. Repeated random sampling and analysis can help minimize the impact of confounding variables over time. This concept plays a huge role in the usefulness of big data and machine learning, which has gained a lot of importance in many industries within this century. While almost any tool that can be used for observational analysis can also be used for a randomized experiment, this book focuses primarily on tools for observational analysis, as this is more common in most industries.

It can be said that statistics is a science for helping make the best decisions when there are quantifiable uncertainties. All statistical tests contain a null hypothesis and an alternative hypothesis. That is to say, an assumption that there is no statistically significant difference between data (the null hypothesis) or that there is a statistically significant difference between data (the alternative hypothesis). The term statistically significant difference implies the existence of a benchmark – or threshold – beyond which a measure takes place and indicates significance. This benchmark is called the critical value.

The measure that is applied against this critical value is called the test statistic. The critical value is a static value quantified based on behavior in the data, such as the average and variation, and is based on the hypothesis. If there are two possible routes by which a null hypothesis may be rejected – for example, we believe some output is either less than or more than the average – there will be two critical values (this test is called a two-tailed hypothesis test), but if there is only one argument against the null hypothesis, there will be only one critical value (this is called a one-tailed hypothesis test). Regardless of the number of critical values, there will always only be one test statistic measurement for each group within a given hypothesis test. If the test statistic exceeds the critical value, there is a statistically significant reason to support rejecting the null hypothesis and concluding there is a statistically significant difference in the data.

It is useful to understand that a hypothesis test can test the following:

One variable against another (such as in a t-test)
Multiple variables against one variable (for example, linear regression)
Multiple variables against multiple variables (for example, MANOVA)

In the following figure, we can see visually the relationship between the test statistic and critical values in a two-tailed hypothesis test.

Figure 1.3 – Critical values versus a test statistic in a two-tailed hypothesis test

Based on the figure, we now have a visual idea of how a test statistic exceeding the critical value suggests rejecting the null hypothesis.

One concern with using only the approach of measuring test statistics against critical values in the hypothesis, however, is that test statistics can be impractically large. This is likely to occur when there may be a wide range of results that are not considered to fall within the bounds of a treatment effect. It is uncertain whether a result as extreme as or more extreme than the test statistic is possible. To prevent misleadingly rejecting the null hypothesis, a p-value is used. The p-value represents the probability that chance alone resulted in a value as extreme as the one observed (the one that suggests rejecting the null hypothesis). If a p-value is low, relative to the level of significance, the null hypothesis can be rejected. Common levels of significance are 0.01, 0.05, and 0.10. It is beneficial to confirm prior to making a decision on a hypothesis to assess both the critical value’s relationship to the test statistic and the p-value. More will be discussed in Chapter 3, Hypothesis Testing, when we begin discussing hypothesis testing.

You have been reading a chapter from

Building Statistical Models in Python

Published in: Aug 2023Publisher: PacktISBN-13: 9781804614280

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Huy Hoang Nguyen

Huy Hoang Nguyen is a Mathematician and a Data Scientist with far-ranging experience, championing advanced mathematics and strategic leadership, and applied machine learning research. He holds a Master's in Data Science and a PhD in Mathematics. His previous work was related to Partial Differential Equations, Functional Analysis and their applications in Fluid Mechanics. He transitioned from academia to the healthcare industry and has performed different Data Science projects from traditional Machine Learning to Deep Learning.
Read more about Huy Hoang Nguyen

Paul N Adams

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering and classification. Paul holds a Master of Science in Data Science from Southern Methodist University.
Read more about Paul N Adams

Stuart J Miller

Stuart Miller is a Machine Learning Engineer with degrees in Data Science, Electrical Engineering, and Engineering Physics. Stuart has worked at several Fortune 500 companies, including Texas Instruments and StateFarm, where he built software that utilized statistical and machine learning techniques. Stuart is currently an engineer at Toyota Connected helping to build a more modern cockpit experience for drivers using machine learning.
Read more about Stuart J Miller

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages