Packt+ | Advance your knowledge in tech

You're reading from R for Data Science Cookbook (n)

Product type Book

Published in Jul 2016

Publisher

ISBN-13 9781784390815

Pages 452 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Yu-Wei, Chiu (David Chiu)

Table of Contents (19) Chapters

R for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Functions in R

Data Extracting, Transforming, and Loading

Data Preprocessing and Preparation

Data Manipulation

Visualizing Data with ggplot2

Making Interactive Reports

Simulation from Probability Distributions

Statistical Inference in R

Rule and Pattern Mining with R

Time Series Mining with R

Supervised Machine Learning

Unsupervised Machine Learning

Index

Chapter 8. Statistical Inference in R

This chapter covers the following topics:

Getting confidence intervals
Performing Z-tests
Performing student's T-tests
Conducting exact binomial tests
Performing the Kolmogorov-Smirnov tests
Working with the Pearson's chi-squared tests
Understanding the Wilcoxon Rank Sum and Signed Rank test
Performing one-way ANOVA
Performing two-way ANOVA

Introduction

The most prominent feature of R is that it implements a wide variety of statistical packages. Using these packages, it is easy to obtain descriptive statistics about a dataset or infer the distribution of a population from sample data. Moreover, with R's plotting capabilities, we can easily display data in a variety of charts.

To apply statistical methods in R, the user can categorize the method of implementation into descriptive statistics and inferential statistics, described as follows:

Descriptive statistics: These are used to summarize the characteristics of data. The user can use mean and standard deviation to describe numerical data, and they can use frequency and percentages to describe categorical data.
Inferential statistics: This is when, based on patterns within sample data, the user can infer the characteristics of the population. Methods relating to inferential statistics include hypothesis testing, data estimation, data correlation, and relationship modeling. Inference...

Getting confidence intervals

Using confidence intervals allows us to estimate the interval range of unknown parameters in the data. In this recipe, we will teach you methods that can help obtain confidence intervals in R.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to obtain confidence intervals:

Let's first generate a normal distribution using the rnorm function:

>set.seed(123)
>population<- rnorm(1000, mean = 10, sd = 3)
>plot(dens, col="red", main="A density plot of normal distribution")

Figure 1: A density plot of normal distribution

Next, we will sample 100 samples out of the population:

>samp<- sample(population, 100)
>mean(samp)
[1] 10.32479
>sd(samp)
[1] 3.167692

At this point, we can obtain the Z-score at a confidence of 99%:
```
> 1 - 0.01 / 2
[1] 0.995

>qnorm(0.995)
[1] 2.575829
```
We can now compute the standard deviation error and estimate the upper and lower bounds of the population mean:
```
>...
```

Performing Z-tests

When making decisions, it is important to know whether decision error can be controlled or measured. In other words, we want to prove that the hypothesis formed is unlikely to have occurred by chance, and it is statistically significant. In hypothesis testing, there are two types of hypothesis: null hypothesis and alternative hypothesis (research hypothesis). The purpose of hypothesis testing is to validate whether the experiment results are significant. However, to validate whether the alternative hypothesis is acceptable, the alternative hypothesis is deemed to be true if the null hypothesis is rejected.

A Z-test is a parametric hypothesis method that can determine whether the observed sample is statistically significantly different from a population with known standard deviation, based on standard normal distribution.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to calculate the Z-score:

First, collect the volume...

Performing student's T-tests

In a Z-test, we can determine whether two mean values are significantly different if the standard deviation (or variance) is known. However, if the standard deviation is unknown and the sample size is fairly small (less than 30), we can perform a student's T-test instead. A one sample T-test allows us to test whether two means are significantly different; a two sample T-test allows us to test whether the means of two independent groups are different. In this recipe, we will discuss how to conduct a one sample T-test and a two sample T-test using R.

Getting ready

Ensure that you installed R on your operating system.

How to do it…

Perform the following steps to calculate the t-value:

First, we visualize a sample weight vector in a boxplot:

> weight <- c(84.12,85.17,62.18,83.97,76.29,76.89,61.37,70.38,90.98,85.71,89.33,74.56,82.01,75.19,80.97,93.82,78.97,73.58,85.86,76.44)
>boxplot(weight, main="A boxplot of weight")
>abline(h=70,lwd=2, col="red")

Figure...

Conducting exact binomial tests

To perform parametric testing, one must assume that the data follows a specific distribution. However, in most cases, we do not know how the data is distributed. Thus, we can perform a nonparametric (that is, distribution-free) test instead. In the following recipes, we will show you how to perform nonparametric tests in R. First, we will cover how to conduct an exact binomial test in R.

Getting ready

In this recipe, we will use the binom.test function from the stat package.

How to do it…

Perform the following steps to conduct an exact binomial test:

Let's assume there is a game where a gambler can win by rolling the number six on a dice. As part of the rules, the gambler can bring their own dice. If the gambler tried to cheat in the game, they would use a loaded dice to increase their chances of winning. Therefore, if we observe that the gambler won 92 games out of 315, we could determine whether the dice was likely fair by conducting an exact binomial test:
```
...
```

Performing Kolmogorov-Smirnov tests

We use a one-sample Kolmogorov-Smirnov test to compare a sample with reference probability. A two-sample Kolmogorov-Smirnov test compares the cumulative distributions of two datasets. In this recipe, we will demonstrate how to perform a Kolmogorov-Smirnov test with R.

Getting ready

In this recipe, we will use the ks.test function from the stat package.

How to do it…

Perform the following steps to conduct a Kolmogorov-Smirnov test:

Validate whether the x dataset (generated with the rnorm function) is distributed normally with a one-sample Kolmogorov-Smirnov test:

>set.seed(123)
> x <-rnorm(50)
>ks.test(x,"pnorm")

  One-sample
  Kolmogorov-Smirnov test

data:  x
D = 0.073034, p-value =
0.9347
alternative hypothesis: two-sided

Next, we can generate uniformly distributed sample data:

>set.seed(123)
> x <- runif(n=20, min=0, max=20)

> y <- runif(n=20, min=0, max=20)

We first plot the ECDF of two generated data samples:
```
>plot(ecdf...
```

Working with the Pearson's chi-squared tests

In this recipe, we introduced Pearson's chi-squared test, which is used to examine whether the distribution of categorical variables of two groups differ. We will discuss how to conduct Pearson's chi-squared Test in R.

Getting ready

In this recipe, we will use the chisq.test function that originated from the stat package.

How to do it…

Perform the following steps to conduct a Pearson's chi-squared test:

First, build a matrix containing the number of male and female smokers and nonsmokers:

>mat<- matrix(c(2047, 2522, 3512, 1919), nrow = 2, dimnames = list(c("smoke","non-smoke"), c("male","female")))
>mat
malefemale
smoke2047   3512
non-smoke 2522   1919

Then, plot the portion of male and female smokers and nonsmokers in a mosaic plot:
```
>mosaicplot(mat, main="Portion of male and female smokers/non-smokers", color = TRUE)
```
Figure 9: The mosaic plot
Next, perform a Pearson's chi-squared test on the contingency table to test whether the factor...

Understanding the Wilcoxon Rank Sum and Signed Rank tests

The Wilcoxon Rank Sum and Signed Rank test (Mann-Whitney-Wilcoxon) is a nonparametric test of the null hypothesis that two different groups come from the same population without assuming the two groups are normally distributed. This recipe will show you how to conduct a Wilcoxon Rank Sum and Signed Rank test in R.

Getting ready

In this recipe, we will use the wilcox.test function that originated from the stat package.

How to do it…

Perform the following steps to conduct a Wilcoxon Rank Sum and Signed Rank test:

First, prepare the Facebook likes of a fan page:

> likes <- c(17,40,57,30,51,35,59,64,37,49,39,41,17,53,21,28,46,23,14,13,11,17,15,21,9,17,10,11,13,16,18,17,27,11,12,5,8,4,12,7,11,8,4,8,7,3,9,9,9,12,17,6,10)

Then, plot the Facebook Likes data into a histogram:
```
>hist(likes)
```
Figure 10: The histogram of Facebook likes of a fan page
Now, perform a one-sample Wilcoxon signed rank test to determine whether the median of the input...

Conducting one-way ANOVA

Analysis of variance (ANOVA) investigates the relationship between categorical independent variables and continuous dependent variables. You can use it to test whether the means of several groups are equal. If there is only one categorical variable as an independent variable, you can perform a one-way ANOVA. On the other hand, if there are more than two categorical variables, you should perform a two-way ANOVA. In this recipe, we discuss how to conduct one-way ANOVA with R.

Getting ready

In this recipe, we will use the oneway.test and TukeyHSD functions.

How to do it…

Perform the following steps to perform a one-way ANOVA:

We begin by visualizing data with a boxplot:

>data_scientist<- c(95694,82465,85001,74721,73923,94552,96723,90795,103834,120751,82634,55362,105086,79361,79679,105383,85728,71689,92719,87916)
>software_eng<- c(78069,82623,73552,85732,75354,81981,91162,83222,74088,91785,89922,84580,80864,70465,94327,70796,104247,96391,75171,65682)
>bi_eng...

Performing two-way ANOVA

Two-way ANOVA can be viewed as an extension of one-way ANOVA because the analysis covers more than two categorical variables rather than just one. In this recipe, we will discuss how to conduct two-way ANOVA in R.

Getting ready

Download the GDP dataset from the following link and ensure that you have installed R on your operating system: https://github.com/ywchiu/rcookbook/raw/master/chapter5/engineer.csv.

How to do it…

Perform the following steps to perform two-way ANOVA:

First, load the engineer's salary data from engineer.csv:
```
>engineer<-read.csv("engineer.csv", header = TRUE)
```

Plot the two boxplots of the salary factor in regard to profession and region:

>par(mfrow=c(1,2))
>boxplot(Salary~Profession, data = engineer,xlab='Profession', ylab = "Salary",main='Salary v.s. Profession')
>boxplot(Salary~Region, data = engineer,xlab='Region', ylab = "Salary",main='Salary v.s. Region')

Figure 14: A boxplot of Salary versus Profession and Salary versus Region

Also...

The rest of the chapter is locked

You're reading from R for Data Science Cookbook (n)

Table of Contents (19) Chapters

Chapter 8. Statistical Inference in R

Introduction

Getting confidence intervals

Getting ready

How to do it…

Performing Z-tests

Getting ready

How to do it…

Performing student's T-tests

Getting ready

How to do it…

Conducting exact binomial tests

Getting ready

How to do it…

Performing Kolmogorov-Smirnov tests

Getting ready

How to do it…

Working with the Pearson's chi-squared tests

Getting ready

How to do it…

Understanding the Wilcoxon Rank Sum and Signed Rank tests

Getting ready

How to do it…

Conducting one-way ANOVA

Getting ready

How to do it…

Performing two-way ANOVA

Getting ready

How to do it…

Authors (1)

Personalised recommendations for you

You're reading from R for Data Science Cookbook (n)

Table of Contents (19) Chapters

Chapter 8. Statistical Inference in R

Introduction

Getting confidence intervals

Getting ready

How to do it…

Performing Z-tests

Getting ready

How to do it…

Performing student's T-tests

Getting ready

How to do it…

Conducting exact binomial tests

Getting ready

How to do it…

Performing Kolmogorov-Smirnov tests

Getting ready

How to do it…

Working with the Pearson's chi-squared tests

Getting ready

How to do it…

Understanding the Wilcoxon Rank Sum and Signed Rank tests

Getting ready

How to do it…

Conducting one-way ANOVA

Getting ready

How to do it…

Performing two-way ANOVA

Getting ready

How to do it…

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you