Reader small image

You're reading from  Learning Predictive Analytics with Python

Product typeBook
Published inFeb 2016
Reading LevelIntermediate
Publisher
ISBN-139781783983261
Edition1st Edition
Languages
Right arrow
Authors (2):
Ashish Kumar
Ashish Kumar
author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar

View More author details
Right arrow

Chapter 4. Statistical Concepts for Predictive Modelling

There are a few statistical concepts, such as hypothesis testing, p-values, normal distribution, correlation, and so on without which grasping the concepts and interpreting the results of predictive models becomes very difficult. Thus, it is very critical to understand these concepts, before we delve into the realm of predictive modelling.

In this chapter, we will be going through and learning these statistical concepts so that we can use them in the upcoming chapters. This chapter will cover the following topics:

  • Random sampling and central limit theorem: Understanding the concept of random sampling through an example and illustrating the central limit theorem's application through an example. These two concepts form the backbone of hypothesis testing.

  • Hypothesis testing: Understanding the meaning of the terms, such as null hypothesis, alternate hypothesis, confidence intervals, p-value, significance level, and so on. A step-by-step...

Random sampling and the central limit theorem


Let's try to understand these two important statistical concepts using an example. Suppose one wants to find the average age of one state of India, lets say Tamil Nadu. Now, the safest and brute-force way of doing this will be to gather age information from each citizen of Tamil Nadu and calculate the average for all these ages. But, going to each citizen and asking their age or asking them to tell their age by some method will take a lot of infrastructure and time. It is such a humongous task that census, which attempts to do just that, happens once a decade and what will happen if you decided to do so in a non-census year?

The statisticians face such issues all the time. The answer lies in random sampling. Random sampling means that you take a group of 1000 individuals (or 10000, depending on your capacity, obviously the more the merrier) and calculate the average for this group. You call this A1. Getting to this is easier as 1000 or 10000 is...

Hypothesis testing


The concept we just discussed in the preceding section is used for a very important technique in statistics, called hypothesis testing. In hypothesis testing, we assume a hypothesis (generally related to the value of the estimator) called null hypothesis and try to see whether it holds true or not by applying the rules of a normal distribution. We have another hypothesis called alternate hypothesis.

Null versus alternate hypothesis

There is a catch in deciding what will be the null hypothesis and what will be the alternate hypothesis. The null hypothesis is the initial premise or something that we assume to be true as yet. The alternate hypothesis is something we aren't sure about and are proposing as an alternate premise (almost often contradictory to the null hypothesis) which might or might not be true.

So, when someone is doing a quantitative research to calibrate the value of an estimator, the known value of the parameter is taken as the null hypothesis while the new...

Chi-square tests


The chi-square test is a statistical test commonly used to compare observed data with the expected data assuming that the data follows a certain hypothesis. In a sense, this is also a hypothesis test. You assume one hypothesis, which your data will follow and calculate the expected data according to that hypothesis. You already have the observed data. You calculate the deviation between the observed and expected data using the statistics defined in the following formula:

Where O is the observed value and E is the expected value while the summation is over all the data points.

The chi-square test can be used to do the following things:

  • Show a causal relationship or independence between one input and output variable. We assume that they are independent and calculate the expected values. Then we calculate the chi-square value. If the null hypothesis is rejected, it suggests a relationship between the two variables. The relationship is not just by chance but statistically proven...

Correlation


Another statistical idea which is very basic and important while finding a relation between two variables is called correlation. In a way, one can say that the concept of correlation is the premise of predictive modelling, in the sense that the correlation is the factor relying on which we say that we can predict outcomes.

A good correlation between two variables suggests that there is a sort of dependence between them. If one is changed, the change will be reflected in the other as well. One can say that a good correlation certifies a mathematical relation between two variables and due to this mathematical relationship, we might be able to predict outcomes. This mathematical relation can be anything. If x and y are two variables, which are correlated, then one can write:

If f is a linear function, then a and b are linearly correlated. If f is an exponential function, then a and b are exponentially correlated:

The degree of correlation between the two variables x and y is quantified...

Summary


In this chapter, we skimmed through the basic concepts of statistics. Here is a brief summary of the concepts we learned:

  • Hypothesis testing is used to test the statistical significance of a hypothesis. The one which already exists or is assumed to be true is a null hypothesis, the one which someone is not sure about or is being proposed as an alternate premise is an alternate hypothesis.

  • One needs to calculate a statistic and the associated p-value to conduct the test.

  • Hypothesis testing (p-values) is used to test the significance of the estimates of the coefficients calculated by the model.

  • The chi-square test is used to test the causal relationship between a predictor and an input variable. It can also be used to check whether the data is fair or fake.

  • The correlation coefficient can range from -1 to 1. The closer it is to the extremes, the stronger is the relationship between the two variables.

Linear regression is part of the family of algorithms called supervised algorithms as the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Predictive Analytics with Python
Published in: Feb 2016Publisher: ISBN-13: 9781783983261
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Ashish Kumar

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about Ashish Kumar