# Sampling and Generalization

In this chapter, we will describe the concept of populations and sampling from populations, including some common strategies for sampling. The discussion of sampling will lead to a section that will describe generalization. Generalization will be discussed as it relates to using samples to make conclusions about their respective populations. When modeling for statistical inference, it is necessary to ensure that samples can be generalized to populations. We will provide an in-depth overview of this bridge through the subjects in this chapter.

We will cover the following main topics:

- Software and environment setup
- Population versus sample
- Population inference from samples
- Sampling strategies – random, systematic, and stratified

# Software and environment setup

**Python** is one of the most popular programming languages for data science and machine learning thanks to the large open source community that has driven the development of these libraries. Python’s ease of use and flexible nature made it a prime candidate in the data science world, where experimentation and iteration are key features of the development cycle. While there are new languages in development for data science applications, such as **Julia**, Python currently remains the key language for data science due to its wide breadth of open source projects, supporting applications from statistical modeling to deep learning. We have chosen to use Python in this book due to its positioning as an important language for data science and its demand in the job market.

Python is available for all major operating systems: Microsoft Windows, macOS, and Linux. Additionally, the installer and documentation can be found at the official website: https://www.python.org/.

This book is written for Python version 3.8 (or higher). It is recommended that you use whatever recent version of Python that is available. It is not likely that the code found in this book will be compatible with Python 2.7, and most active libraries have already started dropping support for Python 2.7 since official support ended in 2020.

The libraries used in this book can be installed with the Python package manager, `pip`

, which is part of the standard Python library in contemporary versions of Python. More information about `pip`

can be found here: https://docs.python.org/3/installing/index.html. After `pip`

is installed, packages can be installed using `pip`

on the command line. Here is basic usage at a glance:

Install a new package using the latest version:

pip install SomePackage

Install the package with a specific version, version `2.1`

in this example:

pip install SomePackage==2.1

A package that is already installed can be upgraded with the `--`

`upgrade`

flag:

pip install SomePackage –upgrade

In general, it is recommended to use Python virtual environments between projects and to keep project dependencies separate from system directories. Python provides a virtual environment utility, `venv`

, which, like `pip`

, is part of the standard library in contemporary versions of Python. Virtual environments allow you to create individual binaries of Python, where each binary of Python has its own set of installed dependencies. Using virtual environments can prevent package version issues and conflict when working on multiple Python projects. Details on setting up and using virtual environments can be found here: https://docs.python.org/3/library/venv.html.

While we recommend the use of Python and Python’s virtual environments for environment setups, a highly recommended alternative is **Anaconda**. Anaconda is a free (enterprise-ready) analytics-focused distribution of Python by Anaconda Inc. (previously Continuum Analytics). Anaconda distributions come with many of the core data science packages, common IDEs (such as **Jupyter** and **Visual Studio Code**), and a graphical user interface for managing environments. Anaconda can be installed using the installer found at the Anaconda website here: https://www.anaconda.com/products/distribution.

Anaconda comes with its own package manager, `conda`

, which can be used to install new packages similarly to `pip`

.

Install a new package using the latest version:

conda install SomePackage

Upgrade a package that is already installed:

conda upgrade SomePackage

Throughout this book, we will make use of several core libraries in the Python data science ecosystem, such as `NumPy`

for array manipulations, `pandas`

for higher-level data manipulations, and `matplotlib`

for data visualization. The package versions used for this book are contained in the following list. Please ensure that the versions installed in your environment are equal to or greater than the versions listed. This will help ensure that the code examples run correctly:

`statsmodels 0.13.2`

`Matplotlib 3.5.2`

`NumPy 1.23.0`

`SciPy 1.8.1`

`scikit-learn 1.1.1`

`pandas 1.4.3`

The packages used for the code in this book are shown here in *Figure 1**.1*. The `__version__`

method can be used to print the package version in code.

Figure 1.1 – Package versions used in this book

Having set up the technical environment for the book, let’s get into the statistics. In the next sections, we will discuss the concepts of population and sampling. We will demonstrate sampling strategies with code implementations.

# Population versus sample

In general, the goal of statistical modeling is to answer a question about a group by making an inference about that group. The group we are making an inference on could be machines in a production factory, people voting in an election, or plants on different plots of land. The entire group, every individual item or entity, is referred to as the **population**. In most cases, the population of interest is so large that it is not practical or even possible to collect data on every entity in the population. For instance, using the voting example, it would probably not be possible to poll every person that voted in an election. Even if it was possible to reach all the voters for the election of interest, many voters may not consent to polling, which would prevent collection on the entire population. An additional consideration would be the expense of polling such a large group. These factors make it practically impossible to collect population statistics in our example of vote polling. These types of prohibitive factors exist in many cases where we may want to assess a population-level attribute. Fortunately, we do not need to collect data on the entire population of interest. Inferences about a population can be made using a subset of the population. This subset of the population is called a **sample**. This is the main idea of statistical modeling. A model will be created using a sample and inferences will be made about the population.

In order to make valid inferences about the population of interest using a sample, the sample must be *representative* of the population of interest, meaning that the sample should contain the variation found in the population. For example, if we were interested in making an inference about plants in a field, it is unlikely that samples from one corner of the field would be sufficient for inferences about the larger population. There would likely be variations in plant characteristics over the entire field. We could think of various reasons why there might be variation. For this example, we will consider some examples from *Figure 1**.2*.

Figure 1.2 – Field of plants

The figure shows that **Sample A** is near a forest. This sample area may be affected by the presence of the forest; for example, some of the plants in that sample may receive less sunlight than plants in the other sample. **Sample B** is shown to be in between the main irrigation lines. It’s conceivable that this sample receives more water on average than the other two samples, which may have an effect on the plants in this sample. The final **Sample C** is near a road. This sample may see other effects that are not seen in **Sample A** or **B**.

If samples were only taken from one of those sections, the inferences from those samples would be *biased* and would not provide valid references about the population. Thus, samples would need to be taken from across the entire field to create a sample that is more likely to be representative of the population of plants. When taking samples from populations, it is critical to ensure the sampling method is robust to possible issues, such as the influence of irrigation and shade in the previous example. Whenever taking a sample from a population, it’s important to identify and mitigate possible influences of bias because biases in data will affect your model and skew your conclusions.

In the next section, various methods for sampling from a dataset will be discussed. An additional consideration is the sample size. The sample size impacts the type of statistical tools we can use, the distributional assumptions that can be made about the sample, and the confidence of inferences and predictions. The impact of sample size will be explored in depth in *Chapter 2**, Distributions of Data *and* **Chapter 3**, **Hypothesis Testing*.

# Population inference from samples

When using a statistical model to make inferential conclusions about a population from a sample subset of that population, the study design must account for similar degrees of uncertainty in its variables as those in the population. This is the variation mentioned earlier in this chapter. To appropriately draw inferential conclusions about a population, any statistical model must be structured around a chance mechanism. Studies structured around these chance mechanisms are called **randomized experiments** and provide an understanding of both correlation and causation.

## Randomized experiments

There are two primary characteristics of a randomized experiment:

- Random sampling, colloquially referred to as random selection
- Random assignment of treatments, which is the nature of the study

### Random sampling

Random sampling (also called random selection) is designed with the intent of creating a sample representative of the overall population so that statistical models generalize the population well enough to assign cause-and-effect outcomes. In order for random sampling to be successful, the population of interest must be well defined. All samples taken from the population must have a chance of being selected. In considering the example of polling voters, all voters must be willing to be polled. Once all voters are entered into a lottery, random sampling can be used to subset voters for modeling. Sampling from only voters who are willing to be polled introduces sampling bias into statistical modeling, which can lead to skewed results. The sampling method in the scenario where only some voters are willing to participate is called **self-selection**. Any information obtained and modeled from self-selected samples – or any non-random samples – cannot be used for inference.

### Random assignment of treatments

The random assignment of treatments refers to two motivators:

- The first motivator is to gain an understanding of specific input variables and their influence on the response – for example, understanding whether assigning treatment A to a specific individual may produce more favorable outcomes than a placebo.
- The second motivator is to remove the impact of external variables on the outcomes of a study. These external variables, called
**confounding variables**(or**confounders**), are important to remove as they often prove difficult to control. They may have unpredictable values or even be unknown to the researcher. The consequence of including confounders is that the outcomes of a study may not be replicable, which can be costly. While confounders can influence outcomes, they can also influence input variables, as well as the relationships between those variables.

Referring back to the example in the earlier section, *Population versus sample*, consider a farmer who decides to start using pesticides on his crops and wants to test two different brands. The farmer knows there are three distinct areas of the land; plot A, plot B, and plot C. To determine the success of the pesticides and prevent damage to the crops, the farmer randomly chooses 60 plants from each plot (this is called **stratified random sampling** where random sampling is stratified across each plot) for testing. This selection is representative of the population of plants. From this selection, the farmer labels his plants (labeling doesn’t need to be random). For each plot, the farmer shuffles the labels into a bag, to randomize them, and begins selecting 30 plants. The first 30 plants get one of two treatments and the other 30 are given the other treatment. This is a *random assignment of treatment*. Assuming the three separate plots represent a distinct set of confounding variables on crop yield, the farmer will have enough information to obtain an inference about the crop yield for each pesticide brand.

## Observational study

The other type of statistical study often performed is an **observational study**, in which the researcher seeks to learn through observing data that already exists. An observational study can aid in the understanding of input variables and their relationships to both the target and each other, but cannot provide cause-and-effect understanding as a randomized experiment can. An observational study may have one of the two components of a randomized experiment – either random sampling or random assignment of treatment – but without both components, will not directly yield inference. There are many reasons why an observational study may be performed versus a randomized experiment, such as the following:

- A randomized experiment being too costly
- Ethical constraints for an experiment (for example, an experiment to determine the rate of birth defects caused by smoking while pregnant)
- Using data from prior randomized experiments, which thus removes the need for another experiment

One method for deriving some causality from an observational study is to perform random sampling and repeated analysis. Repeated random sampling and analysis can help minimize the impact of confounding variables over time. This concept plays a huge role in the usefulness of *big data* and *machine learning*, which has gained a lot of importance in many industries within this century. While almost any tool that can be used for observational analysis can also be used for a randomized experiment, this book focuses primarily on tools for observational analysis, as this is more common in most industries.

It can be said that statistics is a science for helping make the best decisions when there are quantifiable uncertainties. All statistical tests contain a null hypothesis and an alternative hypothesis. That is to say, an assumption that there is no statistically significant difference between data (the null hypothesis) or that there is a statistically significant difference between data (the alternative hypothesis). The term statistically significant difference implies the existence of a benchmark – or threshold – beyond which a measure takes place and indicates significance. This benchmark is called the **critical value**.

The measure that is applied against this critical value is called the **test statistic**. The critical value is a static value quantified based on behavior in the data, such as the average and variation, and is based on the hypothesis. If there are two possible routes by which a null hypothesis may be rejected – for example, we believe some output is either less than or more than the average – there will be two critical values (this test is called a **two-tailed** hypothesis test), but if there is only one argument against the null hypothesis, there will be only one critical value (this is called a **one-tailed** hypothesis test). Regardless of the number of critical values, there will always only be one test statistic measurement for each group within a given hypothesis test. If the test statistic exceeds the critical value, there is a statistically significant reason to support rejecting the null hypothesis and concluding there is a statistically significant difference in the data.

It is useful to understand that a hypothesis test can test the following:

- One variable against another (such as in a t-test)
- Multiple variables against one variable (for example, linear regression)
- Multiple variables against multiple variables (for example, MANOVA)

In the following figure, we can see visually the relationship between the test statistic and critical values in a two-tailed hypothesis test.

Figure 1.3 – Critical values versus a test statistic in a two-tailed hypothesis test

Based on the figure, we now have a visual idea of how a test statistic exceeding the critical value suggests rejecting the null hypothesis.

One concern with using only the approach of measuring test statistics against critical values in the hypothesis, however, is that test statistics can be impractically large. This is likely to occur when there may be a wide range of results that are not considered to fall within the bounds of a treatment effect. It is uncertain whether a result as extreme as or more extreme than the test statistic is possible. To prevent misleadingly rejecting the null hypothesis, a **p-value** is used. The p-value represents the probability that chance alone resulted in a value as extreme as the one observed (the one that suggests rejecting the null hypothesis). If a p-value is low, relative to the level of significance, the null hypothesis can be rejected. Common levels of significance are 0.01, 0.05, and 0.10. It is beneficial to confirm prior to making a decision on a hypothesis to assess both the critical value’s relationship to the test statistic and the p-value. More will be discussed in *Chapter 3**, Hypothesis Testing*, when we begin discussing hypothesis testing.

# Sampling strategies – random, systematic, stratified, and clustering

In this section, we will discuss the different sampling methods used in research. Broadly speaking, in the real world, it is not easy or possible to get the whole population data for many reasons. For instance, the costs of gathering data are expensive in terms of money and time. Collecting all the data is impractical in many cases and ethical issues are also considered. Taking samples from the population can help us overcome these problems and is a more efficient way to collect data. By collecting an appropriate sample for a study, we can draw statistical conclusions or statistical inferences about the population properties. Inferential statistical analysis is a fundamental aspect of statistical thinking. Different sampling methods from probability strategies to non-probability strategies used in research and industry will be discussed in this section.

There are essentially two types of sampling methods:

- Probability sampling
- Non-probability sampling

## Probability sampling

In *probability sampling*, a sample is chosen from a population based on the theory of probability, or it is chosen randomly using random selection. In *random selection*, the chance of each member in a population being selected is equal. For example, consider a game with 10 similar pieces of paper. We write numbers 1 through 10, with a separate piece of paper for each number. The numbers are then shuffled in a box. The game requires picking three of these ten pieces of paper randomly. Because the pieces of paper have been prepared using the same process, the chance of any piece of paper being selected (or the numbers one through ten) is equal for each piece. Collectively, the 10 pieces of paper are considered a population and the 3 selected pieces of paper constitute a random sample. This example is one approach to the probability sampling methods we will discuss in this chapter.

Figure 1.4 – A random sampling example

We can implement the sampling method described before (and shown in *Figure 1**.4*) with `numpy`

. We will use the `choice`

method to select three samples from the given population. Notice that `replace==False`

is used in the choice. This means that once a sample is chosen, it will not be considered again. Note that the random generator is used in the following code for reproducibility:

import numpy as np # setup generator for reproducibility random_generator = np.random.default_rng(2020) population = np.arange(1, 10 + 1) sample = random_generator.choice( population, #sample from population size=3, #number of samples to take replace=False #only allow to sample individuals once ) print(sample) # array([1, 8, 5])

The purpose of random selection is to avoid a biased result when some units of a population have a lower or higher probability of being chosen in a sample than others. Nowadays, a random selection process can be done by using computer randomization programs.

Four main types of the probability sampling methods that will be discussed here are as follows:

- Simple random sampling
- Systematic sampling
- Stratified sampling
- Cluster sampling

Let’s look at each one of them.

### Simple random sampling

First, simple random sampling is a method to select a sample randomly from a population. Every member of the subset (or the sample) has an equal chance of being chosen through an unbiased selection method. This method is used when all members of a population have similar properties related to important variables (important features) and it is the most direct approach to probability sampling. The advantages of this method are to minimize bias and maximize representativeness. However, while this method helps limit a biased approach, there is a risk of errors with simple random sampling. This method also has some limitations. For instance, when the population is very large, there can be high costs and a lot of time required. Sampling errors need to be considered when a sample is not representative of the population and the study needs to perform this sampling process again. In addition, not every member of a population is willing to participate in the study voluntarily, which makes it a big challenge to obtain good information representative of a large population. The previous example of choosing 3 pieces of paper from 10 pieces of paper is a simple random sample.

### Systematic sampling

Here, members of a population are selected at a random starting point with a fixed sampling interval. We first choose a fixed sampling interval by dividing the number of members in a population by the number of members in a sample that the study conducts. Then, a random starting point between the number one and the number of members in the sampling interval is selected. Finally, we choose subsequent members by repeating this sampling process until enough samples have been collected. This method is faster and preferable than simple random sampling when cost and time are the main factors to be considered in the study. On the other hand, while in simple random sampling, each member of a population has an equal chance of being selected, in systematic sampling, a sampling interval rule is used to choose a member from a population in a sample for a study. It can be said that systematic sampling is less random than simple random sampling. Similarly, as in simple random sampling, member properties of a population are similarly related to important variables/features. Let us discuss how we perform systematic sampling through the following example. In a class at one high school in Dallas, there are 50 students but only 10 books to give to these students. The sampling interval is fixed by dividing the number of students in the class by the number of books (50/10 = 5). We also need to generate a random number between one and 50 as a random starting point. For example, take the number 18. Hence, the 10 students selected to get the books will be as follows:

18, 23, 28, 33, 38, 43, 48, 3, 8, 13

The natural question arises as to whether the interval sampling is a fraction. For example, if we have 13 books, then the sampling interval will be 50/13 ~ 3.846. However, we cannot choose this fractional number as a sampling interval that represents the number of students. In this situation, we could choose number 3 or 4, alternatively, as the sampling intervals (we could also choose either 3 or 4 as the sampling interval). Let us assume that a random starting point generated is 17. Then, the 13 selected students are these:

17, 20, 24, 27, 31, 34, 38, 41, 45, 48, 2, 5, 9

Observing the preceding series of numbers, after reaching the number 48, since adding 4 will produce a number greater than the count of students (50 students), the sequence restarts at 2 (48 + 4 = 52, but since 50 is the maximum, we restart at 2). Therefore, the last three numbers in the sequence are 2, 5, and 9, with the sampling intervals 4, 3, and 4, respectively (passing the number 50 and back to the number 1 until we have 13 selected students for the systematic sample).

With systematic sampling, there is a biased risk when the list of members of a population is organized to match the sampling interval. For example, going back to the case of 50 students, researchers want to know how students feel about mathematics classes. However, if the best students in math correspond to numbers 2, 12, 22, 32, and 42, then the survey could be biased if conducted when the random starting point is 2 and the sampling interval is 10.

### Stratified sampling

It is a probability sampling method based on dividing a population into homogeneous subpopulations called **strata**. Each stratum splits based on distinctly different properties, such as gender, age, color, and so on. These subpopulations must be distinct so that every member in each stratum has an equal chance of being selected by using simple random sampling. *Figure 1**.5* illustrates how stratified sampling is performed to select samples from two subpopulations (a set of numbers and a set of letters):

Figure 1.5 – A stratified sample example

The following code sample shows how to implement stratified sampling with `numpy`

using the example shown in *Figure 1**.5*. First, the instances are split into the respective strata: numbers and letters. Then, we use `numpy`

to take random samples from each stratum. Like in the previous code example, we utilize the `choice`

method to take the random sample, but the sample size for each stratum is based on the total number of instances in each stratum rather than the total number of instances in the entire population; for example, sampling 50% of the numbers and 50% of the letters:

import numpy as np # setup generator for reproducibility random_generator = np.random.default_rng(2020) population = [ 1, "A", 3, 4, 5, 2, "D", 8, "C", 7, 6, "B" ] # group strata strata = { 'number' : [], 'string' : [], } for item in population: if isinstance(item, int): strata['number'].append(item) else: strata['string'].append(item) # fraction of population to sample sample_fraction = 0.5 # random sample from stata sampled_strata = {} for group in strata: sample_size = int( sample_fraction * len(strata[group]) ) sampled_strata[group] = random_generator.choice( strata[group], size=sample_size, replace=False ) print(sampled_strata) #{'number': array([2, 8, 5, 1]), 'string': array(['D', 'C'], dtype='<U1')}

The main advantage of this method is that key population characteristics in a sample better represent the population that is studied and are also proportional to the overall population. This method helps to reduce sample selection bias. On the other hand, when classifying each member of a population into distinct subpopulations is not obvious, this method becomes unusable.

### Cluster sampling

Here, a population is divided into different subgroups called clusters. Each cluster has homogeneous characteristics. Instead of randomly selecting individual members in each cluster, entire clusters are randomly chosen and each of these clusters has an equal chance of being selected as part of a sample. If clusters are large, then we can conduct a **multistage sampling** by using one of the previous sampling methods to select individual members within each cluster. A cluster sampling example is discussed now. A local pizzeria plans to expand its business in the neighborhood. The owner wants to know how many people order pizzas from his pizzeria and what the preferred pizzas are. He then splits the neighborhood into different areas and selects clients randomly to form cluster samples. A survey is sent to the selected clients for his business study. Another example is related to multistage cluster sampling. A retail chain store conducts a study to see the performance of each store in the chain. The stores are divided into subgroups based on location, then samples are randomly selected to form clusters, and the sample cluster is used as a performance study of his stores. This method is easy and convenient. However, the sample clusters are not guaranteed to be representative of the whole population.

## Non-probability sampling

The other type of sampling method is non-probability sampling, where some or all members of a population do not have an equal chance of being selected as a sample to participate in the study. This method is used when random probability sampling is impossible to conduct and it is faster and easier to obtain data compared to the probability sampling method. One of the reasons to use this method is due to cost and time considerations. It allows us to collect data easily by using a non-random selection based on convenience or certain criteria. This method can lead to a higher-biased risk than the probability sampling method. The method is often used in exploratory and qualitative research. For example, if a group of researchers wants to understand clients’ opinions of a company related to one of its products, they send a survey to the clients who bought and used the product. It is a convenient way to get opinions, but these opinions are only from clients who already used the product. Therefore, the sample data is only representative of one group of clients and cannot be generalized as the opinions of all the clients of the company.

Figure 1.6 – A survey study example

The previous example is one of two types of non-probability sampling methods that we want to discuss here. This method is **convenience sampling**. In convenience sampling, researchers choose members the most accessible to the researchers from a population to form a sample. This method is easy and inexpensive but generalizing the results obtained to the whole population is questionable.

**Quota sampling** is another type of non-probability sampling where a sample group is selected to be representative of a larger population in a non-random way. For example, recruiters with limited time can use the quota sampling method to search for potential candidates from professional social networks (LinkedIn, Indeed.com, etc.) and interview them. This method is cost-effective and saves time but presents bias during the selection process.

In this section, we provided an overview of probability and non-probability sampling. Each strategy has advantages and disadvantages, but they help us to minimize risks, such as bias. A well-planned sampling strategy will also help reduce errors in predictive modeling.

# Summary

In this chapter, we discussed installing and setting up the Python environment to run the `Statsmodels`

API and other requisite open-source packages. We also discussed populations versus samples and the requirements to gain inference from samples. Finally, we explained several different common sampling methods used in statistical and machine learning models.

In the next chapter, we will begin a discussion on statistical distributions and their implications for building statistical models. In *Chapter 3**, Hypothesis Testing*, we will begin discussing hypothesis testing in depth, expanding on the concepts discussed in the *Observational study* section of this chapter. We will also discuss power analysis, which is a useful tool for determining the sample size based on existing sample data parameters and the desired levels of statistical significance.