Reader small image

You're reading from  R Statistics Cookbook

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781789802566
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Francisco Juretig
Francisco Juretig
author image
Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig

Right arrow

Nonparametric Methods

We will cover the following recipes in this chapter:

  • The Mann-Whitney test
  • Estimating nonparametric ANOVA
  • The Spearman's rank correlation test
  • LOESS regression
  • Finding the best transformations via the acepack package
  • Nonparametric multivariate tests using the npmv package
  • Semiparametric regression with the semiPar package

Introduction

Unfortunately, parametric methods such as the t-test or ordinary least squares (OLS), make very strong assumptions about the distribution of the data. To some extent, they still work if the distributional assumptions are relaxed, but it really depends to which extent these assumptions are violated.

Nonparametric methods do not work with the usual parametrized distributions and are instead designed to work with any distribution. This gives them a distinct flexibility, and we are no longer required to check any distributional assumption on the data. If the data follows the same distribution that its parametric counterpart requires, they usually perform almost as well.

The Mann-Whitney test

We have already discussed how to compare the means from two groups, when both groups are distributed according to a Gaussian distribution with the same variance. However, the nonparametric test requires no distributional assumption and works well almost every time. Of course, if both distributions are Gaussian with the same variance, then the regular t-test is better—this is derived from the fact that the t-test is uniformly the most powerful one.

The Mann-Whitney-Wilcoxon test is a nonparametric test that tests the null hypothesis that any element chosen at random from group A is equally likely to be greater or smaller than a respective random item from group B. A different way of posing this test is to think of it as a test of whether the distributions of group A and B are the same. The only strong assumption that this test requires is that the observations...

Estimating nonparametric ANOVA

The Mann-Whitney-Wilcoxon test that we presented in the previous recipe, can be extended to multiple groups (not just 2 was before). For one-way Analysis Of Variance (ANOVA), the test that is used is called Kruskal-Wallis; we have the kruskal.test() function in base R.

For nonparametric two-way ANOVA, the Scheirer-Ray-Hare test can be used; however, the documentation is scarce, and it is not frequently used.

Getting ready

In order to run this script, you need to install the FSA and dplyr packages.

How to do it...

We will work with nonparametric...

The Spearman's rank correlation test

The correlation coefficient between X and Y that we usually use is obtained by dividing the covariance of X, Y by the product of the variances of X and Y. It is therefore restricted to lie between -1 and 1. When the correlation is -1, it means that there is a strong negative relationship between the variables. When it is 1, it means that there is a strong positive relationship; and when it is 0, it means that there is no relationship between the variables. But there is an implicit assumption that we usually overlook: the correlation coefficient assumes that there is a linear relationship. So, it is easy to imagine lots of cases where there might be a relationship, but not a linear one.

The Spearman rank statistic does not test correlation in the traditional sense ((whether a greater than average value of X is associated linearly with a...

LOESS regression

When we have a scatterplot between two variables Y, X we usually want to present a curve that relates the two variables. Firstly, because it allows us to see if the relationship is linear (or almost linear); secondly, because interpreting scatterplots is sometimes hard; and, finally, because we might want to have a simple model that can be used to predict Y in terms of X capturing all possible nonlinear patterns.

Locally Estimated Scatterplot Smoothing (LOESS) regression works by fitting lots of local models around each point. These local models are then averaged out. In particular, each model (fitted around a point X0,Y0) is fitted using weighted least squares (each point is weighted by how close the regressors are to the point X0). There is a parameter specified by the user, called the bandwidth, which specifies how much data is used in each one of these regressions...

Finding the best transformations via the acepack package

When fitting linear regressions models, we always want them to fit as best as possible into the data. Sometimes, we want to transform our variables in order to get the model fit to improve as much as possible. For example, we could apply several transformations (taking logarithms, squared values, and so on) in order to improve the fit.

The acepack package implements the alternating conditional expectation algorithm, which finds the optimal transformations that we need to apply to our data in order to maximize the R2. Another way of looking at this would be: given the data that we have, what would be the best R2 we could get if we found the best possible transformations? In this fashion, we could get a maximum boundary on the best model that we would be able to get, assuming we can only transform the variables to capture...

Nonparametric multivariate tests using the npmv package

In our parametric scenario, we used the t-test to compare means across two populations, and Hotelling's T2 to compare a vector of means across two populations. We then extended these cases to ANOVA and MANOVA respectively in case we were dealing with multiple populations. The underlying assumption is that the data comes from a Gaussian population in the first case and a multivariate Gaussian in the second one. In this recipe we will use the npmv package to to non-parametric MANOVA.

Traditional Multivariate Analysis Of Variance (MANOVA) has two main problems: firstly, it depends on a multivariate Gaussian assumption that is hard to satisfy in practice. Secondly, it is hard to identify which are the groups or variables producing the differences.

The npmv package offers a solution to both problems: it does not rely on any...

Semiparametric regression with the SemiPar package

Semiparametric models encompass a huge family of models that have a fully parametric (finite number of parameters) with a nonparametric part. In general, the parametric part will be linear, and the semiparametric part will be treated as nuisance; but this is not always the case. One example where a semiparametric model would be relevant, could be for example modeling the ice-cream sales in terms of the weather and the price. It's likely that the sales-weather relationship is highly nonlinear (sales are really high when the temperature is high, but low when the temperature is moderate), whereas the price-sales one could be quite linear. In that case, we would want to treat the price effect as linear and the rest as nuisance.

Getting...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Statistics Cookbook
Published in: Mar 2019Publisher: PacktISBN-13: 9781789802566
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Francisco Juretig

Francisco Juretig has worked for over a decade in a variety of industries such as retail, gambling and finance deploying data-science solutions. He has written several R packages, and is a frequent contributor to the open source community.
Read more about Francisco Juretig