Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
R for Data Science Cookbook (n)

You're reading from  R for Data Science Cookbook (n)

Product type Book
Published in Jul 2016
Publisher
ISBN-13 9781784390815
Pages 452 pages
Edition 1st Edition
Languages
Author (1):
Yu-Wei, Chiu (David Chiu) Yu-Wei, Chiu (David Chiu)
Profile icon Yu-Wei, Chiu (David Chiu)

Table of Contents (19) Chapters

R for Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Functions in R Data Extracting, Transforming, and Loading Data Preprocessing and Preparation Data Manipulation Visualizing Data with ggplot2 Making Interactive Reports Simulation from Probability Distributions Statistical Inference in R Rule and Pattern Mining with R Time Series Mining with R Supervised Machine Learning Unsupervised Machine Learning Index

Chapter 8. Statistical Inference in R

This chapter covers the following topics:

  • Getting confidence intervals

  • Performing Z-tests

  • Performing student's T-tests

  • Conducting exact binomial tests

  • Performing the Kolmogorov-Smirnov tests

  • Working with the Pearson's chi-squared tests

  • Understanding the Wilcoxon Rank Sum and Signed Rank test

  • Performing one-way ANOVA

  • Performing two-way ANOVA

Introduction


The most prominent feature of R is that it implements a wide variety of statistical packages. Using these packages, it is easy to obtain descriptive statistics about a dataset or infer the distribution of a population from sample data. Moreover, with R's plotting capabilities, we can easily display data in a variety of charts.

To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics, described as follows:

  • Descriptive statistics: These are used to summarize the characteristics of data. The user can use mean and standard deviation to describe numerical data, and they can use frequency and percentages to describe categorical data.

  • Inferential statistics: This is when, based on patterns within sample data, the user can infer the characteristics of the population. Methods relating to inferential statistics include hypothesis testing, data estimation, data correlation, and relationship modeling. Inference...

Getting confidence intervals


Using confidence intervals allows us to estimate the interval range of unknown parameters in the data. In this recipe, we will teach you methods that can help obtain confidence intervals in R.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to obtain confidence intervals:

  1. Let's first generate a normal distribution using the rnorm function:

    >set.seed(123)
    >population<- rnorm(1000, mean = 10, sd = 3)
    >plot(dens, col="red", main="A density plot of normal distribution")
    

    Figure 1: A density plot of normal distribution

  2. Next, we will sample 100 samples out of the population:

    >samp<- sample(population, 100)
    >mean(samp)
    [1] 10.32479
    >sd(samp)
    [1] 3.167692
    
  3. At this point, we can obtain the Z-score at a confidence of 99%:

    > 1 - 0.01 / 2
    [1] 0.995
    
    >qnorm(0.995)
    [1] 2.575829
    
  4. We can now compute the standard deviation error and estimate the upper and lower bounds of the population mean:

    >...

Performing Z-tests


When making decisions, it is important to know whether decision error can be controlled or measured. In other words, we want to prove that the hypothesis formed is unlikely to have occurred by chance, and it is statistically significant. In hypothesis testing, there are two types of hypothesis: null hypothesis and alternative hypothesis (research hypothesis). The purpose of hypothesis testing is to validate whether the experiment results are significant. However, to validate whether the alternative hypothesis is acceptable, the alternative hypothesis is deemed to be true if the null hypothesis is rejected.

A Z-test is a parametric hypothesis method that can determine whether the observed sample is statistically significantly different from a population with known standard deviation, based on standard normal distribution.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to calculate the Z-score:

  1. First, collect the volume...

Performing student's T-tests


In a Z-test, we can determine whether two mean values are significantly different if the standard deviation (or variance) is known. However, if the standard deviation is unknown and the sample size is fairly small (less than 30), we can perform a student's T-test instead. A one sample T-test allows us to test whether two means are significantly different; a two sample T-test allows us to test whether the means of two independent groups are different. In this recipe, we will discuss how to conduct a one sample T-test and a two sample T-test using R.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to calculate the t-value:

  1. First, we visualize a sample weight vector in a boxplot:

    > weight <- c(84.12,85.17,62.18,83.97,76.29,76.89,61.37,70.38,90.98,85.71,89.33,74.56,82.01,75.19,80.97,93.82,78.97,73.58,85.86,76.44)
    >boxplot(weight, main="A boxplot of weight")
    >abline(h=70,lwd=2, col="red")
    

    Figure...

Conducting exact binomial tests


To perform parametric testing, one must assume that the data follows a specific distribution. However, in most cases, we do not know how the data is distributed. Thus, we can perform a nonparametric (that is, distribution-free) test instead. In the following recipes, we will show you how to perform nonparametric tests in R. First, we will cover how to conduct an exact binomial test in R.

Getting ready

In this recipe, we will use the binom.test function from the stat package.

How to do it…

Perform the following steps to conduct an exact binomial test:

  1. Let's assume there is a game where a gambler can win by rolling the number six on a dice. As part of the rules, the gambler can bring their own dice. If the gambler tried to cheat in the game, they would use a loaded dice to increase their chances of winning. Therefore, if we observe that the gambler won 92 games out of 315, we could determine whether the dice was likely fair by conducting an exact binomial test:

    ...

Performing Kolmogorov-Smirnov tests


We use a one-sample Kolmogorov-Smirnov test to compare a sample with reference probability. A two-sample Kolmogorov-Smirnov test compares the cumulative distributions of two datasets. In this recipe, we will demonstrate how to perform a Kolmogorov-Smirnov test with R.

Getting ready

In this recipe, we will use the ks.test function from the stat package.

How to do it…

Perform the following steps to conduct a Kolmogorov-Smirnov test:

  1. Validate whether the x dataset (generated with the rnorm function) is distributed normally with a one-sample Kolmogorov-Smirnov test:

    >set.seed(123)
    > x <-rnorm(50)
    >ks.test(x,"pnorm")
    
      One-sample
      Kolmogorov-Smirnov test
    
    data:  x
    D = 0.073034, p-value =
    0.9347
    alternative hypothesis: two-sided
    
  2. Next, we can generate uniformly distributed sample data:

    >set.seed(123)
    > x <- runif(n=20, min=0, max=20)
    
    > y <- runif(n=20, min=0, max=20)
    
  3. We first plot the ECDF of two generated data samples:

    >plot(ecdf...

Working with the Pearson's chi-squared tests


In this recipe, we introduced Pearson's chi-squared test, which is used to examine whether the distribution of categorical variables of two groups differ. We will discuss how to conduct Pearson's chi-squared Test in R.

Getting ready

In this recipe, we will use the chisq.test function that originated from the stat package.

How to do it…

Perform the following steps to conduct a Pearson's chi-squared test:

  1. First, build a matrix containing the number of male and female smokers and nonsmokers:

    >mat<- matrix(c(2047, 2522, 3512, 1919), nrow = 2, dimnames = list(c("smoke","non-smoke"), c("male","female")))
    >mat
    malefemale
    smoke2047   3512
    non-smoke 2522   1919
    
  2. Then, plot the portion of male and female smokers and nonsmokers in a mosaic plot:

    >mosaicplot(mat, main="Portion of male and female smokers/non-smokers", color = TRUE)
    

    Figure 9: The mosaic plot

  3. Next, perform a Pearson's chi-squared test on the contingency table to test whether the factor...

Understanding the Wilcoxon Rank Sum and Signed Rank tests


The Wilcoxon Rank Sum and Signed Rank test (Mann-Whitney-Wilcoxon) is a nonparametric test of the null hypothesis that two different groups come from the same population without assuming the two groups are normally distributed. This recipe will show you how to conduct a Wilcoxon Rank Sum and Signed Rank test in R.

Getting ready

In this recipe, we will use the wilcox.test function that originated from the stat package.

How to do it…

Perform the following steps to conduct a Wilcoxon Rank Sum and Signed Rank test:

  1. First, prepare the Facebook likes of a fan page:

    > likes <- c(17,40,57,30,51,35,59,64,37,49,39,41,17,53,21,28,46,23,14,13,11,17,15,21,9,17,10,11,13,16,18,17,27,11,12,5,8,4,12,7,11,8,4,8,7,3,9,9,9,12,17,6,10)
    
  2. Then, plot the Facebook Likes data into a histogram:

    >hist(likes)
    

    Figure 10: The histogram of Facebook likes of a fan page

  3. Now, perform a one-sample Wilcoxon signed rank test to determine whether the median of the input...

Conducting one-way ANOVA


Analysis of variance (ANOVA) investigates the relationship between categorical independent variables and continuous dependent variables. You can use it to test whether the means of several groups are equal. If there is only one categorical variable as an independent variable, you can perform a one-way ANOVA. On the other hand, if there are more than two categorical variables, you should perform a two-way ANOVA. In this recipe, we discuss how to conduct one-way ANOVA with R.

Getting ready

In this recipe, we will use the oneway.test and TukeyHSD functions.

How to do it…

Perform the following steps to perform a one-way ANOVA:

  1. We begin by visualizing data with a boxplot:

    >data_scientist<- c(95694,82465,85001,74721,73923,94552,96723,90795,103834,120751,82634,55362,105086,79361,79679,105383,85728,71689,92719,87916)
    >software_eng<- c(78069,82623,73552,85732,75354,81981,91162,83222,74088,91785,89922,84580,80864,70465,94327,70796,104247,96391,75171,65682)
    >bi_eng...

Performing two-way ANOVA


Two-way ANOVA can be viewed as an extension of one-way ANOVA because the analysis covers more than two categorical variables rather than just one. In this recipe, we will discuss how to conduct two-way ANOVA in R.

Getting ready

Download the GDP dataset from the following link and ensure that you have installed R on your operating system: https://github.com/ywchiu/rcookbook/raw/master/chapter5/engineer.csv.

How to do it…

Perform the following steps to perform two-way ANOVA:

  1. First, load the engineer's salary data from engineer.csv:

    >engineer<-read.csv("engineer.csv", header = TRUE)
    
  2. Plot the two boxplots of the salary factor in regard to profession and region:

    >par(mfrow=c(1,2))
    >boxplot(Salary~Profession, data = engineer,xlab='Profession', ylab = "Salary",main='Salary v.s. Profession')
    >boxplot(Salary~Region, data = engineer,xlab='Region', ylab = "Salary",main='Salary v.s. Region')
    

    Figure 14: A boxplot of Salary versus Profession and Salary versus Region

  3. Also...

lock icon The rest of the chapter is locked
You have been reading a chapter from
R for Data Science Cookbook (n)
Published in: Jul 2016 Publisher: ISBN-13: 9781784390815
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}