Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Analysis with R, Second Edition - Second Edition

You're reading from  Data Analysis with R, Second Edition - Second Edition

Product type Book
Published in Mar 2018
Publisher Packt
ISBN-13 9781788393720
Pages 570 pages
Edition 2nd Edition
Languages

Table of Contents (24) Chapters

Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
RefresheR The Shape of Data Describing Relationships Probability Using Data To Reason About The World Testing Hypotheses Bayesian Methods The Bootstrap Predicting Continuous Variables Predicting Categorical Variables Predicting Changes with Time Sources of Data Dealing with Missing Data Dealing with Messy Data Dealing with Large Data Working with Popular R Packages Reproducibility and Best Practices Other Books You May Enjoy Index

Chapter 5. Using Data To Reason About The World

In Chapter 4, Probability, we mentioned that the mean height of US females is 65 inches. Now pretend we didn't know this fact—how could we find out what the average height is?

We can measure every US female, but that's untenable; we would run out of money, resources, and time before we even finished with a small city!

Inferential statistics gives us the power to answer this question using a very small sample of all US women. We can use the sample to tell us something about the population we drew it from. We can use observed data to make inferences about unobserved data. By the end of this chapter, you too will be able to go out and collect a small amount of data and use it to reason about the world!

Estimating means


In the example that is going to span this entire chapter, we are going to be examining how we would estimate the mean height of all US women using only samples. Specifically, we will be estimating the population parameters using samples means as an estimator.

I am going to use the vector all.us.women to represent the population. For simplicity's sake, let's say there are only 10,000 US women:

 > # setting seed will make random number generation reproducible 
 > set.seed(1) 
 > all.us.women <- rnorm(10000, mean=65, sd=3.5) 

We have just created a vector of 10,000 normally distributed random variables with the same parameters as our population of interest using the rnorm function. Of course, at this point, we can just call mean on this vector and call it a day—but that's cheating! We are going to see that we can get really really close to the population mean without actually using the entire population.

Now, let's take a random sample of ten from this population using...

The sampling distribution


So, we have estimated that the true population mean is about 65.2; we know the population mean isn't exactly 65.19704—but by just how much might our estimate be off?

To answer this question, let's take repeated samples from the population again. This time, we're going to take samples of size 40 from the population 10,000 times and plot a frequency distribution of the means:

 > means.of.our.samples <- numeric(10000) 
 > for(i in 1:10000){ 
 +   a.sample <- sample(all.us.women, 40) 
 +   means.of.our.samples[i] <- mean(a.sample) 
 + } 

We get the distribution as follows:

Figure 5.3: The sampling distribution of sample means

This frequency distribution is called a sampling distribution. In particular, as we used sample means as the value of interest, this is called the sampling distribution of the sample means (whew!). You can create a sampling distribution of any statistic (median, variance, and so on), but when we refer to sampling distributions throughout...

Interval estimation


Again, we care about the standard error (the standard deviation of the sampling distribution of sample means) because it expresses the degree of uncertainty we have in our estimation. Due to this, it's not uncommon for statisticians to report the standard error along with their estimate.

What's more common, though, is for statisticians to report a range of numbers to describe their estimates; this is called interval estimation. In contrast, when we were just providing the sample mean as our estimate of the population mean, we were engaging in point estimation.

One common approach to interval estimation is to use confidence intervals. A confidence interval gives us a range over which a significant proportion of the sample means would fall when samples are repeatedly drawn from a population and their means are calculated. Concretely, a 95 percent confidence interval is the range that would contain 95 percent of the sample means if multiple samples were taken from the same...

Smaller samples


Remember when I said that the sampling distribution of sample means is approximately normal for a large enough sample size? This caveat means that for smaller sample sizes (usually considered to be below 30), the sampling distribution of the sample means is not well approximated by a normal distribution. It is, however, well approximated by another distribution: the t-distribution.

Note

A bit of history... The t-distribution is also known as the Student's t-distribution. It gets its name from the 1908 paper that introduces it, by William Sealy Gosset writing under the pen name Student. Gosset worked as a statistician at the Guinness Brewery and used the t-distribution and the related t-test to study small samples of the quality of the beer's raw constituents. He is thought to have used a pen name at the request of Guinness so that competitors wouldn't know that they were using the t statistic to their advantage.

The t-distribution has two parameters, the mean and the degrees...

Exercises


Practice the following exercises to revise the concepts learned in this chapter:

  • Write a function that takes a vector and returns the 95 percent confidence interval for that vector. You can return the interval as a vector of length two: the lower bound and the upper bound. Then, parameterize the confidence coefficient by letting the user of your function choose their own confidence level, but keep 95 percent as the default. Hint: the first line will look like this:
conf.int <- function(data.vector, conf.coeff=.95){ 
  • Back when we introduced the central limit theorem, I said that the sampling distribution from any distribution would be approximately normal. Don't take my word for it! Create a population that is uniformly distributed using the runif function and plot a histogram of the sampling distribution using the code from this chapter and the histogram-plotting code from Chapter 2The Shape of Data. Repeat the process using the beta distribution with parameters (a=0.5, b=0.5...

Summary


The central idea of this chapter is that making the leap from sample to population carries a certain amount of uncertainty with it. In order to be good, honest analysts, we need to be able to express and quantify this uncertainty.

The example we chose to illustrate this principle was estimating population mean from a sample's mean. You learned that the uncertainty associated with inferring the population mean from sample means is modeled by the sampling distribution of the sample means. The central limit theorem tells us the parameters we can expect of this sampling distribution. You learned that we could use these parameters on their own, or in the construction of confidence intervals, to express our level of uncertainty about our estimate.

I want to congratulate you for getting this far. The topics introduced in this chapter are very often considered the most difficult to grasp in all of introductory data analysis.

Your tenacity will be greatly rewarded, though; we have laid enough...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Analysis with R, Second Edition - Second Edition
Published in: Mar 2018 Publisher: Packt ISBN-13: 9781788393720
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}