You're reading from Building Statistical Models in Python

Product typeBook

Published inAug 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781804614280

Edition1st Edition

Languages

Python

Concepts

Statistics

Authors (3):

Huy Hoang Nguyen

Paul N Adams

Stuart J Miller

View More author details

Population versus sample

In general, the goal of statistical modeling is to answer a question about a group by making an inference about that group. The group we are making an inference on could be machines in a production factory, people voting in an election, or plants on different plots of land. The entire group, every individual item or entity, is referred to as the population. In most cases, the population of interest is so large that it is not practical or even possible to collect data on every entity in the population. For instance, using the voting example, it would probably not be possible to poll every person that voted in an election. Even if it was possible to reach all the voters for the election of interest, many voters may not consent to polling, which would prevent collection on the entire population. An additional consideration would be the expense of polling such a large group. These factors make it practically impossible to collect population statistics in our example of vote polling. These types of prohibitive factors exist in many cases where we may want to assess a population-level attribute. Fortunately, we do not need to collect data on the entire population of interest. Inferences about a population can be made using a subset of the population. This subset of the population is called a sample. This is the main idea of statistical modeling. A model will be created using a sample and inferences will be made about the population.

In order to make valid inferences about the population of interest using a sample, the sample must be representative of the population of interest, meaning that the sample should contain the variation found in the population. For example, if we were interested in making an inference about plants in a field, it is unlikely that samples from one corner of the field would be sufficient for inferences about the larger population. There would likely be variations in plant characteristics over the entire field. We could think of various reasons why there might be variation. For this example, we will consider some examples from Figure 1.2.

Figure 1.2 – Field of plants

The figure shows that Sample A is near a forest. This sample area may be affected by the presence of the forest; for example, some of the plants in that sample may receive less sunlight than plants in the other sample. Sample B is shown to be in between the main irrigation lines. It’s conceivable that this sample receives more water on average than the other two samples, which may have an effect on the plants in this sample. The final Sample C is near a road. This sample may see other effects that are not seen in Sample A or B.

If samples were only taken from one of those sections, the inferences from those samples would be biased and would not provide valid references about the population. Thus, samples would need to be taken from across the entire field to create a sample that is more likely to be representative of the population of plants. When taking samples from populations, it is critical to ensure the sampling method is robust to possible issues, such as the influence of irrigation and shade in the previous example. Whenever taking a sample from a population, it’s important to identify and mitigate possible influences of bias because biases in data will affect your model and skew your conclusions.

In the next section, various methods for sampling from a dataset will be discussed. An additional consideration is the sample size. The sample size impacts the type of statistical tools we can use, the distributional assumptions that can be made about the sample, and the confidence of inferences and predictions. The impact of sample size will be explored in depth in Chapter 2, Distributions of Data and Chapter 3, Hypothesis Testing.

You have been reading a chapter from

Building Statistical Models in Python

Published in: Aug 2023Publisher: PacktISBN-13: 9781804614280

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Huy Hoang Nguyen

Huy Hoang Nguyen is a Mathematician and a Data Scientist with far-ranging experience, championing advanced mathematics and strategic leadership, and applied machine learning research. He holds a Master's in Data Science and a PhD in Mathematics. His previous work was related to Partial Differential Equations, Functional Analysis and their applications in Fluid Mechanics. He transitioned from academia to the healthcare industry and has performed different Data Science projects from traditional Machine Learning to Deep Learning.
Read more about Huy Hoang Nguyen

Paul N Adams

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering and classification. Paul holds a Master of Science in Data Science from Southern Methodist University.
Read more about Paul N Adams

Stuart J Miller

Stuart Miller is a Machine Learning Engineer with degrees in Data Science, Electrical Engineering, and Engineering Physics. Stuart has worked at several Fortune 500 companies, including Texas Instruments and StateFarm, where he built software that utilized statistical and machine learning techniques. Stuart is currently an engineer at Toyota Connected helping to build a more modern cockpit experience for drivers using machine learning.
Read more about Stuart J Miller

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages