Chapter 5. Using Data To Reason About The World
In Chapter 4, Probability, we mentioned that the mean height of US females is 65 inches. Now pretend we didn't know this fact—how could we find out what the average height is?
We can measure every US female, but that's untenable; we would run out of money, resources, and time before we even finished with a small city!
Inferential statistics gives us the power to answer this question using a very small sample of all US women. We can use the sample to tell us something about the population we drew it from. We can use observed data to make inferences about unobserved data. By the end of this chapter, you too will be able to go out and collect a small amount of data and use it to reason about the world!
In the example that is going to span this entire chapter, we are going to be examining how we would estimate the mean height of all US women using only samples. Specifically, we will be estimating the population parameters using samples means as an estimator.
I am going to use the vector all.us.women
to represent the population. For simplicity's sake, let's say there are only 10,000 US women:
> # setting seed will make random number generation reproducible
> set.seed(1)
> all.us.women <- rnorm(10000, mean=65, sd=3.5)
We have just created a vector of 10,000 normally distributed random variables with the same parameters as our population of interest using the rnorm
function. Of course, at this point, we can just call mean
on this vector and call it a day—but that's cheating! We are going to see that we can get really really close to the population mean without actually using the entire population.
Now, let's take a random sample of ten from this population using...
The sampling distribution
So, we have estimated that the true population mean is about 65.2; we know the population mean isn't exactly 65.19704
—but by just how much might our estimate be off?
To answer this question, let's take repeated samples from the population again. This time, we're going to take samples of size 40 from the population 10,000 times and plot a frequency distribution of the means:
> means.of.our.samples <- numeric(10000)
> for(i in 1:10000){
+ a.sample <- sample(all.us.women, 40)
+ means.of.our.samples[i] <- mean(a.sample)
+ }
We get the distribution as follows:
Figure 5.3: The sampling distribution of sample means
This frequency distribution is called a sampling distribution. In particular, as we used sample means as the value of interest, this is called the sampling distribution of the sample means (whew!). You can create a sampling distribution of any statistic (median, variance, and so on), but when we refer to sampling distributions throughout...
Again, we care about the standard error (the standard deviation of the sampling distribution of sample means) because it expresses the degree of uncertainty we have in our estimation. Due to this, it's not uncommon for statisticians to report the standard error along with their estimate.
What's more common, though, is for statisticians to report a range of numbers to describe their estimates; this is called interval estimation. In contrast, when we were just providing the sample mean as our estimate of the population mean, we were engaging in point estimation.
One common approach to interval estimation is to use confidence intervals. A confidence interval gives us a range over which a significant proportion of the sample means would fall when samples are repeatedly drawn from a population and their means are calculated. Concretely, a 95 percent confidence interval is the range that would contain 95 percent of the sample means if multiple samples were taken from the same...
Remember when I said that the sampling distribution of sample means is approximately normal for a large enough sample size? This caveat means that for smaller sample sizes (usually considered to be below 30), the sampling distribution of the sample means is not well approximated by a normal distribution. It is, however, well approximated by another distribution: the t-distribution.
Note
A bit of history...
The t-distribution is also known as the Student's t-distribution. It gets its name from the 1908 paper that introduces it, by William Sealy Gosset writing under the pen name Student. Gosset worked as a statistician at the Guinness Brewery and used the t-distribution and the related t-test to study small samples of the quality of the beer's raw constituents. He is thought to have used a pen name at the request of Guinness so that competitors wouldn't know that they were using the t statistic to their advantage.
The t-distribution has two parameters, the mean and the degrees...
Practice the following exercises to revise the concepts learned in this chapter:
- Write a function that takes a vector and returns the 95 percent confidence interval for that vector. You can return the interval as a vector of length two: the lower bound and the upper bound. Then, parameterize the confidence coefficient by letting the user of your function choose their own confidence level, but keep 95 percent as the default. Hint: the first line will look like this:
conf.int <- function(data.vector, conf.coeff=.95){
- Back when we introduced the central limit theorem, I said that the sampling distribution from any distribution would be approximately normal. Don't take my word for it! Create a population that is uniformly distributed using the
runif
function and plot a histogram of the sampling distribution using the code from this chapter and the histogram-plotting code from Chapter 2, The Shape of Data. Repeat the process using the beta distribution with parameters (a=0.5
, b=0.5...
The central idea of this chapter is that making the leap from sample to population carries a certain amount of uncertainty with it. In order to be good, honest analysts, we need to be able to express and quantify this uncertainty.
The example we chose to illustrate this principle was estimating population mean from a sample's mean. You learned that the uncertainty associated with inferring the population mean from sample means is modeled by the sampling distribution of the sample means. The central limit theorem tells us the parameters we can expect of this sampling distribution. You learned that we could use these parameters on their own, or in the construction of confidence intervals, to express our level of uncertainty about our estimate.
I want to congratulate you for getting this far. The topics introduced in this chapter are very often considered the most difficult to grasp in all of introductory data analysis.
Your tenacity will be greatly rewarded, though; we have laid enough...