Reader small image

You're reading from  Building Statistical Models in Python

Product typeBook
Published inAug 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781804614280
Edition1st Edition
Languages
Concepts
Right arrow
Authors (3):
Huy Hoang Nguyen
Huy Hoang Nguyen
author image
Huy Hoang Nguyen

Huy Hoang Nguyen is a Mathematician and a Data Scientist with far-ranging experience, championing advanced mathematics and strategic leadership, and applied machine learning research. He holds a Master's in Data Science and a PhD in Mathematics. His previous work was related to Partial Differential Equations, Functional Analysis and their applications in Fluid Mechanics. He transitioned from academia to the healthcare industry and has performed different Data Science projects from traditional Machine Learning to Deep Learning.
Read more about Huy Hoang Nguyen

Paul N Adams
Paul N Adams
author image
Paul N Adams

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering and classification. Paul holds a Master of Science in Data Science from Southern Methodist University.
Read more about Paul N Adams

Stuart J Miller
Stuart J Miller
author image
Stuart J Miller

Stuart Miller is a Machine Learning Engineer with degrees in Data Science, Electrical Engineering, and Engineering Physics. Stuart has worked at several Fortune 500 companies, including Texas Instruments and StateFarm, where he built software that utilized statistical and machine learning techniques. Stuart is currently an engineer at Toyota Connected helping to build a more modern cockpit experience for drivers using machine learning.
Read more about Stuart J Miller

View More author details
Right arrow

Non-Parametric Tests

In the previous chapter, we discussed parametric tests. Parametric tests are useful when test assumptions are met. However, there are cases where those assumptions are not met. In this chapter, we will discuss several non-parametric alternatives to the parametric tests presented in the previous chapter. We start by introducing the concept of a non-parametric test. Then, we will discuss several non-parametric tests that can be used when t-test or z-test assumptions are not met.

In this chapter, we’re going to cover the following main topics:

  • When parametric test assumptions are violated
  • The rank-sum test
  • The signed-rank test
  • The Kruskal-Wallis test
  • The chi-square test
  • Spearman’s correlation analysis
  • Chi-square power analysis

When parametric test assumptions are violated

In the previous chapter, we discussed parametric tests. Parametric tests have strong statistical power but also require adherence to strong assumptions. When the assumptions are not satisfied, the test results are not valid. Fortunately, we have alternative tests that can be used when the assumptions of a parametric test are not satisfied. These tests are called non-parametric tests, meaning that they make no assumptions about the underlying distribution of the data. While non-parametric tests do not require distributional assumptions, these tests will still require the samples to be independent.

Permutation tests

For the first non-parametric test, let’s look more deeply at the definition of a p-value. A p-value is the probability of obtaining a test statistic at least as extreme as the observed value under the assumption of the null hypothesis. Then, to calculate a p-value, we need the null distribution and an observed statistic...

The Rank-Sum test

When the assumptions of the t-test are not met, the Rank-Sum test is often a good non-parametric alternative test. While the t-test can be used to test for the difference between the means of two distributions, the Rank-Sum test is used to test for the difference between the locations of two distributions. This difference in the test utility is due to the lack of parametric assumptions in the Rank-Sum test. The null hypothesis of the Rank-Sum test is that the distribution underlying the first sample is the same as the second sample. If the sample distributions appear to be similar, this allows us to use the Rank-Sum test to test for the difference in the locations of the two samples. As stated, the Rank-Sum test cannot specifically be used for testing the difference between means because it does not require assumptions about the sample distributions.

The test statistic procedure

The test procedure is straightforward. The process is outlined here and an example...

The Signed-Rank test

The Wilcoxon Signed-Rank test is a non-parametric alternative version of the paired t-test that is used when the assumption of normality is violated. This test is robust to outliers because of the use of ranks and medians instead of means in the null and alternative hypotheses. As indicated by the name of the test, it uses the magnitudes of differences between two stages and their signs.

In research, a null hypothesis considers that the median difference between stage 1 and stage 2 is zero. Similarly, as in a paired t-test, for the alternative hypothesis, for a two-tailed test, the median difference between Stage 1 and Stage 2 is considered not to be zero, or for a one-tailed test, the median difference between Stage 1 and Stage 2 is greater (or less) than zero.

Though the normality requirement is relaxed, the test requires independence between paired observations and these observations to be from the same population. In addition, the dependent variable is...

The Kruskal-Wallis test

Another non-parametric test we will now discuss is the Kruskal-Wallis test. It is an alternative to the one-way ANOVA test when the normality assumption is not satisfied. It uses the medians instead of the means to test whether there are statistically significant differences between two or more independent groups. Let us consider a generic example of three independent groups:

group1 = [8, 13, 13, 15, 12, 10, 6, 15, 13, 9]
group2 = [16, 17, 14, 14, 15, 12, 9, 12, 11, 9]
group3 = [7, 8, 9, 9, 4, 15, 13, 9, 11, 9]

The null and alternative hypotheses are stated as follows.

H 0 : The medians are equal among these three groups

H a : The medians are not equal among these three groups

In Python, it is easy to implement by using the scipy.stats.kruskal function. The documentation can be found at the following link:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html

from scipy import stats
group1 = [8, 13, 13,...

Chi-square distribution

Researchers are often faced with the need to test hypotheses on categorical data. The parametric tests covered in Chapter 4, Parametric Tests, are often not very helpful for this type of analysis. In the last chapter, we discussed using an F-test to compare sample variances. Extending that concept, we can consider the non-parametric and non-symmetric chi-square probability distribution, which is a distribution useful for comparing the means of sampling distribution variances to their population variances, specifically when the mean of a sampling distribution of sample variances is expected to equal the population variance under the null hypothesis. Because variance cannot be negative, the distribution starts at an origin of 0. Here, we can see the chi-square distribution:

Figure 5.5 – Chi-square distribution with seven degrees of freedom

Figure 5.5 – Chi-square distribution with seven degrees of freedom

The shape of the chi-square distribution does not represent an assumption that percentiles...

Chi-square goodness-of-fit

The chi-square goodness-of-fit test compares the count of occurrences of multiple factor levels for a single variable (factor) to determine whether the levels are statistically equal. For example, a vendor offers three models of phones – three levels (brands) of the single factor (phone) – to customers, who purchase in total an average of 90 phones per week. We can say the expected frequency is 1/3 – so, 30 phones of each model are sold per week, on average. Pearson’s chi-square test statistic, which is calculated by measuring the observed frequencies against expected frequencies, is the test statistic used for the chi-square goodness-of-fit test. The linear equation for this test statistic is as follows:

χ 2 = (O i E i) 2 _ E i , degrees of freedom = k-1

Where O i is the observed frequency, E i, is the expected frequency, and k is the number of factor...

Chi-square test of independence

Suppose we have a dataset of observed vehicle crashes in the state of Texas in 2021, Restraint Use by Injury Severity and Seat Position (https://www.txdot.gov/data-maps/crash-reports-records/motor-vehicle-crash-statistics.html), and want to know whether using a seat belt resulted in a statistically significant difference in fatalities. We have the table of observed values as follows:

Restrained

Unrestrained

Total

Fatal

1,429

1,235

2,664

Not Fatal

1,216,934

22,663

1,239,597

Total

1,218,363

23,898

1,242,261

Figure 5.6 – Chi-square...

Chi-square goodness-of-fit test power analysis

Let’s use an example where a phone vendor sells four popular models of phones, models A, B, C, and D. We want to determine how many samples are required to produce a power of 0.8 so we can understand whether there is a statistically significant difference between the popularity of different phones so the vendor can more properly invest in phone acquisitions. In this case, the null hypothesis asserts that 25% of phones from each model were sold. In reality, 20% of phones sold were model A, 30% were model B, 19% were model C, and 31% were model D phones.

Testing different values for the nobs argument (number of observations), we find that a minimum of 224 samples produces a power just greater than 0.801. Adding more samples will only improve this. If the true distribution were more divergent from the hypothesized 25% even split, fewer samples would be required. However, since the splits are relatively close to 25%, a high volume...

Spearman’s rank correlation coefficient

In Chapter 4, Parametric Tests, we looked at the parametric correlation coefficient, Pearson’s correlation, where the coefficient is calculated from independently sampled, continuous data. However, when we have ranked, ordinal data, such as that from a satisfaction survey, we would not want to use Pearson’s correlation as it cannot be assumed to guarantee the preservation of order. As with Pearson’s correlation coefficient, Spearman’s correlation coefficient results in a coefficient, r, that ranges from -1 to 1, with -1 being a strong inverse correlation and 1 being a strong direct correlation. Spearman’s is derived by dividing the covariance of the two variables’ ranks by the product of their standard deviations. The equation for the correlation coefficient, r, is as follows:

r s =  S xy _  _ S xx S yy  

Where

S xy =...

Summary

In this chapter, we discussed some of the most commonly used non-parametric hypothesis tests performed when required assumptions for parametric hypothesis testing cannot be prudently guaranteed. We discussed two-sample Wilcoxon Rank-Sum – also called Mann-Whitney U – tests to draw inferences from medians when two-sample t-testing cannot be performed. Next, we walked through the Wilcoxon Sign-Rank test’s paired comparison of medians when a paired t-test comparison of means cannot be performed. After, we looked at the non-parametric chi-square goodness-of-fit test and the chi-square Test of independence for comparing observed frequencies against expected frequencies, both useful for identifying the presence of statistically significant differences in counts of categorical data. Additionally, we discussed the Kruskal-Wallis test, a non-parametric alternative to the analysis of variance (ANOVA). Finally, we discussed Spearman’s correlation coefficient...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Building Statistical Models in Python
Published in: Aug 2023Publisher: PacktISBN-13: 9781804614280
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Huy Hoang Nguyen

Huy Hoang Nguyen is a Mathematician and a Data Scientist with far-ranging experience, championing advanced mathematics and strategic leadership, and applied machine learning research. He holds a Master's in Data Science and a PhD in Mathematics. His previous work was related to Partial Differential Equations, Functional Analysis and their applications in Fluid Mechanics. He transitioned from academia to the healthcare industry and has performed different Data Science projects from traditional Machine Learning to Deep Learning.
Read more about Huy Hoang Nguyen

author image
Paul N Adams

Paul Adams is a Data Scientist with a background primarily in the healthcare industry. Paul applies statistics and machine learning in multiple areas of industry, focusing on projects in process engineering, process improvement, metrics and business rules development, anomaly detection, forecasting, clustering and classification. Paul holds a Master of Science in Data Science from Southern Methodist University.
Read more about Paul N Adams

author image
Stuart J Miller

Stuart Miller is a Machine Learning Engineer with degrees in Data Science, Electrical Engineering, and Engineering Physics. Stuart has worked at several Fortune 500 companies, including Texas Instruments and StateFarm, where he built software that utilized statistical and machine learning techniques. Stuart is currently an engineer at Toyota Connected helping to build a more modern cockpit experience for drivers using machine learning.
Read more about Stuart J Miller