You're reading from R Statistics Cookbook

Product typeBook

Published inMar 2019

Reading LevelExpert

PublisherPackt

ISBN-139781789802566

Edition1st Edition

Languages

Tools

ggplot

Concepts

Statistics

Author (1)

Francisco Juretig

Simple random sampling

In many situations, we are interested in taking a sample of the data. There could be multiple reasons for doing this, but in most practical considerations this happens due to budget constraints. For example, what if we have a certain amount people in a neighborhood, and we want to estimate what proportion of them is supporting Candidates A or B in an election? Visiting each one of them would be prohibitively expensive, so we might decide to find a smaller subset that we can visit and ask them who they are going to vote for.

The sample function in R can be used to take a sample with or without the replacement of any arbitrary size. Once a sample is defined, we can sample those units. An interesting question is estimating the variability of that sample with respect to a given sample size. In order to do that, we will build a thousand replications of our sampling exercise, and we will get the upper and lower boundaries that enclose 95% of the cases. Nevertheless, we could also compute approximate confidence intervals for the proportion of people voting for Candidate A, using the well known Gaussian approximation to an estimated proportion.

When the sample size is not very small, the estimated proportion is distributed approximately as a Gaussian distribution:

This can be used to compute an approximate confidence interval, where we need to choose in order to achieve an interval of _,as implemented in the following formula:

Both methods will be in agreement for reasonably large samples (>100), but will differ when the sample sizes are very small. Still, the method that uses the sample function can be used for more complex situations, such as cases when we use sample weights.

Getting ready

In order to run this exercise, we need to install the dplyr and the ggplot2 packages via the install.packages() function.

How to do it...

In this recipe, we have the number of people in a town, and whether they vote for Candidate A or Candidate B – these are flagged by 1s and 0s. We want to study the variability of several sample sizes, and ultimately we want to define which sample size is appropriate. Intuition would suggest that the error decreases nonlinearly as we increase the sample size: when the sample size is small, the error should decrease slowly as we increase the sample size. On the other hand, it should decrease quickly when the sample is large.

Import the libraries:

library(dplyr)
library(ggplot2)

Load the dataset:

voters_data = read.csv("./voters_.csv")

Take 1,000 samples of size 10 and calculate the proportion of people voting for Candidate A:

proportions_10sample = c()
for (q in 2:1000){
 sample_data = mean(sample(voters_data$Vote,10,replace = FALSE))
 proportions_10sample = c(proportions_10sample,sample_data)
}

Take 1,000 samples of size 50 and calculate the proportion of people voting for Candidate A:

proportions_50sample = c()
for (q in 2:1000){
 sample_data = mean(sample(voters_data$Vote,50,replace = FALSE))
 proportions_50sample = c(proportions_50sample,sample_data)
}

Take 1,000 samples of size 100 and calculate the proportion of people voting for Candidate A:

proportions_100sample = c()
for (q in 2:1000){
 sample_data = mean(sample(voters_data$Vote,100,replace = FALSE))
 proportions_100sample = c(proportions_100sample,sample_data)
}

Take 1,000 samples of size 500 and calculate the proportion of people voting for Candidate A:

proportions_500sample = c()
for (q in 2:1000){
 sample_data = mean(sample(voters_data$Vote,500,replace = FALSE))
 proportions_500sample = c(proportions_500sample,sample_data)
}

We combine all the DataFrames, and calculate the 2.5% and 97.5% quantiles:

joined_data50 = data.frame("sample_size"=50,"mean"=mean(proportions_50sample), "q2.5"=quantile(proportions_50sample,0.025),"q97.5"=quantile(proportions_50sample,0.975))
joined_data10 = data.frame("sample_size"=10,"mean"=mean(proportions_10sample), "q2.5"=quantile(proportions_10sample,0.025),"q97.5"=quantile(proportions_10sample,0.975))
joined_data100 = data.frame("sample_size"=100,"mean"=mean(proportions_100sample), "q2.5"=quantile(proportions_100sample,0.025),"q97.5"=quantile(proportions_100sample,0.975))
joined_data500 = data.frame("sample_size"=500,"mean"=mean(proportions_500sample), "q2.5"=quantile(proportions_500sample,0.025),"q97.5"=quantile(proportions_500sample,0.975))

After combining them, we use the Gaussian approximation to get the 2.5% and 97.5% quantiles. Note that we use 1.96, which is the associated 97.5% quantile (and -1.96 for the associated 2.5% quantile, due to the symmetry of the Gaussian distribution):

data_sim = rbind(joined_data10,joined_data50,joined_data100,joined_data500)
data_sim = data_sim %>% mutate(Nq2.5 = mean - 1.96*sqrt(mean*(1-mean)/sample_size),N97.5 = mean + 1.96*sqrt(mean*(1-mean)/sample_size))
data_sim$sample_size = as.factor(data_sim$sample_size)

Plot the previous DataFrame using the ggplot function:

ggplot(data_sim, aes(x=sample_size, y=mean, group=1)) +
 geom_point(aes(size=2), alpha=0.52) + theme(legend.position="none") +
 geom_errorbar(width=.1, aes(ymin=q2.5, ymax=q97.5), colour="darkred") + labs(x="Sample Size",y= "Candidate A ratio", title="Candidate A ratio by sample size", subtitle="Proportion of people voting for candidate A, assuming 50-50 chance", caption="Circle is mean / Bands are 95% Confidence bands")

This provides the following result:

How it works...

We run several simulations, using different sample sizes. For each sample size, we take 1,000 samples and we get the mean, and 97.5% and 2.5% percentiles. We use them to construct a 95% confidence interval. It is evident that the greater the sample sizes are, the narrower these bands will be. We can see that when taking just 10 elements the bands are extremely wide, and they get narrower quite quickly when we jump to a larger sample size.

In the following table you can see why, when elections are tight (nearly 50% of voters choosing A), they are so difficult to predict. Since the election is won by whoever has the majority of the votes, and our confidence interval ranges from 0.45 to 0.55 (even when taking a sample size of 500 elements), we can't be certain of who the winner will be.

Take a look at the following results—95% intervals—left simulated ones; right Gaussian approximation:

sample_size	mean	2.5% quantile (sampling)	97.5% quantile (sampling)	Gaussian 2.5% quantile	Gaussian 97.5%
10	0.507	0.20	0.80	0.19	0.81
50	0.504	0.38	0.64	0.36	0.64
100	0.505	0.41	0.60	0.40	0.60
500	0.505	0.45	0.55	0.46	0.54

You have been reading a chapter from

R Statistics Cookbook

Published in: Mar 2019Publisher: PacktISBN-13: 9781789802566

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig

Other recommended products

Related to this chapter

Data Analysis with IBM SPSS Statistics

SPSS Statistics is a software package used for logical batched and non-batched statistical analysis. Analytical tools such as SPSS can readily provide even a novice user with an overwhelming amount of information and a broad range of options for analyzing patterns in the data. This book will have a comprehensive coverage of IBM’s premier statistics and data analysis tool – IBM SPSS Statistics. It is designed for business professionals who wish to analyze their data. By the end of this book, you will have a firm understanding of the various statistical analysis techniques offered by SPSS Statistics, and be able to master its use for data analysis with ease.

BookSep 2017446 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Bayesian Analysis with Python

Bayesian inference uses probability distributions and Bayes' theorem to build flexible models. The book uses PyMC3 to abstract all the mathematical and computational details from this process allowing readers to solve a wide range of problems in data science.

BookDec 2018356 pages4

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Practical Time Series Analysis

Practical Time Series Analysis will introduce you to the basic concepts of time series analysis and describe powerful yet simple techniques in Python which data scientists and data engineers would find useful in dealing with real life datasets in industrial settings. This book focuses on explaining important concepts and practical techniques to process, summarize and model time series data. Real life case studies with code snippets in Python are used to demonstrate the concepts and techniques.

BookSep 2017244 pages

Hands-On Time Series Analysis with R

This book introduces you to time series analysis and forecasting with R; this is one of the key fields in statistical programming and includes techniques for analyzing data to extract meaningful insights. You will explore methods, such as prediction with time series analysis, and identify the relationship between each data point in the series.

BookMay 2019448 pages

Data Analysis with R

R has spread deep into the private sector and can be found in the production pipelines at some of the most advanced and successful enterprises. Starting with the basics of R and statistical reasoning, this book dives into advanced predictive analytics, showing how to apply those techniques to real-world data though with real-world examples.

BookMar 2018570 pages

Statistical Application Development with R and Python

Statistical Analysis involves collecting and examining data to describe the nature of data that needs to be analyzed. It helps you explore the relation of data and build models to make better decisions. You will begin with a brief understanding of the nature of data and end with modern and advanced statistical models like CART. Every step is taken with DATA and R code, and further enhanced by Python. By the end of this book you will be able to apply your statistical learning in major domains at work or in your projects.

BookAug 2017432 pages

Learning Quantitative Finance with R

This book covers applications of quantitative finance in R. It starts with the basics of quantitative finance and goes to complexity at the end of the book along with a varying degree of R complexity. This will guide you to implement different trading strategies for various financial instruments using basic to complex techniques along with its optimization and keeping the risk of financial instruments in check.

BookMar 2017284 pages

Practical Machine Learning Cookbook

Machine learning is the new BLACK GOLD. In this book, we explore topics such as classification, clustering, model selection and regularization, nonlinearity, supervised, unsupervised, and reinforcement learning, structured prediction, neural networks, deep learning, and case studies. The algorithms are developed using R.The book is for students and professionals in the field of statistics, data analytics, and computer science.

BookApr 2017570 pages

Advanced Analytics with R and Tableau

R is the go-to tool for statistics and data mining while Tableau offers an interface to filter data, plug and play with rich visualizations to describe insights from your data. When combined these two tools makes it easier to harness interesting patterns and communicate stories. This book covers various analytical techniques like prediction, classification, clustering and best practices to visualize it using interactive dashboard with drop-downs, sliders, and other visual cues of Tableau. Get to know how R can be used in conjunction with Tableau and implement powerful machine learning techniques making big data analytics accessible and presentable through Tableau workbooks.

BookAug 2017178 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages