Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Mastering Python for Data Science
Mastering Python for Data Science

Mastering Python for Data Science: Explore the world of data science through Python and learn how to make sense of data

Arrow left icon
Profile Icon Samir Madhavan
Arrow right icon
₱580 ₱2694.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.6 (10 Ratings)
eBook Aug 2015 294 pages 1st Edition
eBook
₱580 ₱2694.99
Paperback
₱3367.99
Subscription
Free Trial
Arrow left icon
Profile Icon Samir Madhavan
Arrow right icon
₱580 ₱2694.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.6 (10 Ratings)
eBook Aug 2015 294 pages 1st Edition
eBook
₱580 ₱2694.99
Paperback
₱3367.99
Subscription
Free Trial
eBook
₱580 ₱2694.99
Paperback
₱3367.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Mastering Python for Data Science

Chapter 2. Inferential Statistics

Before getting understanding the inferential statistics, let's look at what descriptive statistics is about.

Descriptive statistics is a term given to data analysis that summarizes data in a meaningful way such that patterns emerge from it. It is a simple way to describe data, but it does not help us to reach a conclusion on the hypothesis that we have made. Let's say you have collected the height of 1,000 people living in Hong Kong. The mean of their height would be descriptive statistics, but their mean height does not indicate that it's the average height of whole of Hong Kong. Here, inferential statistics will help us in determining what the average height of whole of Hong Kong would be, which is described in depth in this chapter.

Inferential statistics is all about describing the larger picture of the analysis with a limited set of data and deriving conclusions from it.

In this chapter, we will cover the following topics:

  • The...

Various forms of distribution

There are various kinds of probability distributions, and each distribution shows the probability of different outcomes for a random experiment. In this section, we'll explore the various kinds of probability distributions.

A normal distribution

A normal distribution is the most common and widely used distribution in statistics. It is also called a "bell curve" and "Gaussian curve" after the mathematician Karl Friedrich Gauss. A normal distribution occurs commonly in nature. Let's take the height example we saw previously. If you have data for the height of all the people of a particular gender in Hong Kong city, and you plot a bar chart where each bar represents the number of people at this particular height, then the curve that is obtained will look very similar to the following graph. The numbers in the plot are the standard deviation numbers from the mean, which is zero. The concept will become clearer as we proceed through...

A z-score

A z-score, in simple terms, is a score that expresses the value of a distribution in standard deviation with respect to the mean. Let's take a look at the following formula that calculates the z-score:

A z-score

Here, X is the value in the distribution, µ is the mean of the distribution, and σ is the standard deviation of the distribution

Let's try to understand this concept from the perspective of a school classroom.

A classroom has 60 students in it and they have just got their mathematics examination score. We simulate the score of these 60 students with a normal distribution using the following command:

>>> classscore
>>> classscore = np.random.normal(50, 10, 60).round()

[ 56.  52.  60.  65.  39.  49.  41.  51.  48.  52.  47.  41.  60.  54.  41.
  46.  37.  50.  50.  55.  47.  53.  38.  42.  42.  57.  40.  45.  35.  39.
  67.  56.  35.  45.  47.  52.  48.  53.  53.  50.  61.  60.  57.  53.  56.
  68.  43.  35.  45.  42.  33.  43.  49.  54.  45.  54...

A p-value

A p-value is the probability of rejecting a null-hypothesis when the hypothesis is proven true. The null hypothesis is a statement that says that there is no difference between two measures. If the hypothesis is that people who clock in 4 hours of study everyday score more that 90 marks out of 100. The null hypothesis here would be that there is no relation between the number of hours clocked in and the marks scored.

If the p-value is equal to or less than the significance level (α), then the null hypothesis is inconsistent and it needs to be rejected.

A p-value

Let's understand this concept with an example where the null hypothesis is that it is common for students to score 68 marks in mathematics.

Let's define the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to score 68 marks in mathematics.

Let's get the z-score of 68 marks:

>>> zscore = ( 68 - classscore.mean() ) / classscore.std()
>&gt...

One-tailed and two-tailed tests

The example in the previous section was an instance of a one-tailed test where the null hypothesis is rejected or accepted based on one direction of the normal distribution.

In a two-tailed test, both the tails of the null hypothesis are used to test the hypothesis.

One-tailed and two-tailed tests

In a two-tailed test, when a significance level of 5% is used, then it is distributed equally in the both directions, that is, 2.5% of it in one direction and 2.5% in the other direction.

Let's understand this with an example. The mean score of the mathematics exam at a national level is 60 marks and the standard deviation is 3 marks.

The mean marks of a class are 53. The null hypothesis is that the mean marks of the class are similar to the national average. Let's test this hypothesis by first getting the z-score 60:

>>> zscore = ( 53 - 60 ) / 3.0
>>> zscore
-2.3333333333333335

The p-value would be:

>>> prob = stats.norm.cdf(zscore)
>>> prob

0.0098153286286453336...

Type 1 and Type 2 errors

Type 1 error is a type of error that occurs when there is a rejection of the null hypothesis when it is actually true. This kind of error is also called an error of the first kind and is equivalent to false positives.

Type 1 and Type 2 errors

Let's understand this concept using an example. There is a new drug that is being developed and it needs to be tested on whether it is effective in combating diseases. The null hypothesis is that it is not effective in combating diseases.

The significance level is kept at 5% so that the null hypothesis can be accepted confidently 95% of the time. However, 5% of the time, we'll accept the rejecttion of the hypothesis although it had to be accepted, which means that even though the drug is ineffective, it is assumed to be effective.

The Type 1 error is controlled by controlling the significance level, which is alpha. Alpha is the highest probability to have a Type 1 error. The lower the alpha, the lower will be the Type 1 error.

The Type 2 error...

Various forms of distribution


There are various kinds of probability distributions, and each distribution shows the probability of different outcomes for a random experiment. In this section, we'll explore the various kinds of probability distributions.

A normal distribution

A normal distribution is the most common and widely used distribution in statistics. It is also called a "bell curve" and "Gaussian curve" after the mathematician Karl Friedrich Gauss. A normal distribution occurs commonly in nature. Let's take the height example we saw previously. If you have data for the height of all the people of a particular gender in Hong Kong city, and you plot a bar chart where each bar represents the number of people at this particular height, then the curve that is obtained will look very similar to the following graph. The numbers in the plot are the standard deviation numbers from the mean, which is zero. The concept will become clearer as we proceed through the chapter.

Also, if you take an...

A z-score


A z-score, in simple terms, is a score that expresses the value of a distribution in standard deviation with respect to the mean. Let's take a look at the following formula that calculates the z-score:

Here, X is the value in the distribution, µ is the mean of the distribution, and σ is the standard deviation of the distribution

Let's try to understand this concept from the perspective of a school classroom.

A classroom has 60 students in it and they have just got their mathematics examination score. We simulate the score of these 60 students with a normal distribution using the following command:

>>> classscore
>>> classscore = np.random.normal(50, 10, 60).round()

[ 56.  52.  60.  65.  39.  49.  41.  51.  48.  52.  47.  41.  60.  54.  41.
  46.  37.  50.  50.  55.  47.  53.  38.  42.  42.  57.  40.  45.  35.  39.
  67.  56.  35.  45.  47.  52.  48.  53.  53.  50.  61.  60.  57.  53.  56.
  68.  43.  35.  45.  42.  33.  43.  49.  54.  45.  54.  48.  55.  56.  30...

A p-value


A p-value is the probability of rejecting a null-hypothesis when the hypothesis is proven true. The null hypothesis is a statement that says that there is no difference between two measures. If the hypothesis is that people who clock in 4 hours of study everyday score more that 90 marks out of 100. The null hypothesis here would be that there is no relation between the number of hours clocked in and the marks scored.

If the p-value is equal to or less than the significance level (α), then the null hypothesis is inconsistent and it needs to be rejected.

Let's understand this concept with an example where the null hypothesis is that it is common for students to score 68 marks in mathematics.

Let's define the significance level at 5%. If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to score 68 marks in mathematics.

Let's get the z-score of 68 marks:

>>> zscore = ( 68 - classscore.mean() ) / classscore.std()
>>> zscore
2.283

Now...

One-tailed and two-tailed tests


The example in the previous section was an instance of a one-tailed test where the null hypothesis is rejected or accepted based on one direction of the normal distribution.

In a two-tailed test, both the tails of the null hypothesis are used to test the hypothesis.

In a two-tailed test, when a significance level of 5% is used, then it is distributed equally in the both directions, that is, 2.5% of it in one direction and 2.5% in the other direction.

Let's understand this with an example. The mean score of the mathematics exam at a national level is 60 marks and the standard deviation is 3 marks.

The mean marks of a class are 53. The null hypothesis is that the mean marks of the class are similar to the national average. Let's test this hypothesis by first getting the z-score 60:

>>> zscore = ( 53 - 60 ) / 3.0
>>> zscore
-2.3333333333333335

The p-value would be:

>>> prob = stats.norm.cdf(zscore)
>>> prob

0.0098153286286453336...

Type 1 and Type 2 errors


Type 1 error is a type of error that occurs when there is a rejection of the null hypothesis when it is actually true. This kind of error is also called an error of the first kind and is equivalent to false positives.

Let's understand this concept using an example. There is a new drug that is being developed and it needs to be tested on whether it is effective in combating diseases. The null hypothesis is that it is not effective in combating diseases.

The significance level is kept at 5% so that the null hypothesis can be accepted confidently 95% of the time. However, 5% of the time, we'll accept the rejecttion of the hypothesis although it had to be accepted, which means that even though the drug is ineffective, it is assumed to be effective.

The Type 1 error is controlled by controlling the significance level, which is alpha. Alpha is the highest probability to have a Type 1 error. The lower the alpha, the lower will be the Type 1 error.

The Type 2 error is the kind...

A confidence interval


A confidence interval is a type of interval statistics for a population parameter. The confidence interval helps in determining the interval at which the population mean can be defined.

Let's try to understand this concept by using an example. Let's take the height of every man in Kenya and determine with 95% confidence interval the average of height of Kenyan men at a national level.

Let's take 50 men and their height in centimeters:

>>> height_data = np.array([ 186.0, 180.0, 195.0, 189.0, 191.0, 177.0, 161.0, 177.0, 192.0, 182.0, 185.0, 192.0,
  173.0, 172.0, 191.0, 184.0, 193.0, 182.0, 190.0, 185.0, 181.0, 188.0, 179.0, 188.0,
  170.0, 179.0, 180.0, 189.0, 188.0, 185.0, 170.0, 197.0, 187.0, 182.0, 173.0, 179.0,
  184.0, 177.0, 190.0, 174.0, 203.0, 206.0, 173.0, 169.0, 178.0, 201.0, 198.0, 166.0,
  171.0, 180.0])

Plotting the distribution, it has a normal distribution:

>>> plt.hist(height_data, 30, normed=True)
>>> plt.show()

The mean of the...

Correlation


In statistics, correlation defines the similarity between two random variables. The most commonly used correlation is the Pearson correlation and it is defined by the following:

The preceding formula defines the Pearson correlation as the covariance between X and Y, which is divided by the standard deviation of X and Y, or it can also be defined as the expected mean of the sum of multiplied difference of random variables with respect to the mean divided by the standard deviation of X and Y. Let's understand this with an example. Let's take the mileage and horsepower of various cars and see if there is a relation between the two. This can be achieved using the pearsonr function in the SciPy package:

>>> mpg = [21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4,
       33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26.0, 30.4, 15.8, 19.7, 15.0, 21.4]
>>> hp = [110, 110, 93, 110, 175, 105, 245, 62, 95, 123...

Z-test vs T-test


We have already done a few Z-tests before where we validated our null hypothesis.

A T-distribution is similar to a Z-distribution—it is centered at zero and has a basic bell shape, but its shorter and flatter around the center than the Z-distribution.

The T-distributions' standard deviation is usually proportionally larger than the Z, because of which you see the fatter tails on each side.

The t distribution is usually used to analyze the population when the sample is small.

The Z-test is used to compare the population mean against a sample or compare the population mean of two distributions with a sample size greater than 30. An example of a Z-test would be comparing the heights of men from different ethnicity groups.

The T-test is used to compare the population mean against a sample, or compare the population mean of two distributions with a sample size less than 30, and when you don't know the population's standard deviation.

Let's do a T-test on two classes that are given...

Left arrow icon Right arrow icon

Description

Data science is a relatively new knowledge domain which is used by various organizations to make data driven decisions. Data scientists have to wear various hats to work with data and to derive value from it. The Python programming language, beyond having conquered the scientific community in the last decade, is now an indispensable tool for the data science practitioner and a must-know tool for every aspiring data scientist. Using Python will offer you a fast, reliable, cross-platform, and mature environment for data analysis, machine learning, and algorithmic problem solving. This comprehensive guide helps you move beyond the hype and transcend the theory by providing you with a hands-on, advanced study of data science. Beginning with the essentials of Python in data science, you will learn to manage data and perform linear algebra in Python. You will move on to deriving inferences from the analysis by performing inferential statistics, and mining data to reveal hidden patterns and trends. You will use the matplot library to create high-end visualizations in Python and uncover the fundamentals of machine learning. Next, you will apply the linear regression technique and also learn to apply the logistic regression technique to your applications, before creating recommendation engines with various collaborative filtering algorithms and improving your predictions by applying the ensemble methods. Finally, you will perform K-means clustering, along with an analysis of unstructured data with different text mining techniques and leveraging the power of Python in big data analytics.

Who is this book for?

If you are a Python developer who wants to master the world of data science then this book is for you. Some knowledge of data science is assumed.

What you will learn

  • Manage data and perform linear algebra in Python
  • Derive inferences from the analysis by performing inferential statistics
  • Solve data science problems in Python
  • Create highend visualizations using Python
  • Evaluate and apply the linear regression technique to estimate the relationships among variables.
  • Build recommendation engines with the various collaborative filtering algorithms
  • Apply the ensemble methods to improve your predictions
  • Work with big data technologies to handle data at scale

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2015
Length: 294 pages
Edition : 1st
Language : English
ISBN-13 : 9781784392628
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Aug 31, 2015
Length: 294 pages
Edition : 1st
Language : English
ISBN-13 : 9781784392628
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 8,369.97
Python Machine Learning
₱2500.99
Learning Data Mining with Python
₱2500.99
Mastering Python for Data Science
₱3367.99
Total 8,369.97 Stars icon

Table of Contents

13 Chapters
1. Getting Started with Raw Data Chevron down icon Chevron up icon
2. Inferential Statistics Chevron down icon Chevron up icon
3. Finding a Needle in a Haystack Chevron down icon Chevron up icon
4. Making Sense of Data through Advanced Visualization Chevron down icon Chevron up icon
5. Uncovering Machine Learning Chevron down icon Chevron up icon
6. Performing Predictions with a Linear Regression Chevron down icon Chevron up icon
7. Estimating the Likelihood of Events Chevron down icon Chevron up icon
8. Generating Recommendations with Collaborative Filtering Chevron down icon Chevron up icon
9. Pushing Boundaries with Ensemble Models Chevron down icon Chevron up icon
10. Applying Segmentation with k-means Clustering Chevron down icon Chevron up icon
11. Analyzing Unstructured Data with Text Mining Chevron down icon Chevron up icon
12. Leveraging Python in the World of Big Data Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.6
(10 Ratings)
5 star 40%
4 star 30%
3 star 0%
2 star 10%
1 star 20%
Filter icon Filter
Top Reviews

Filter reviews by




ruben Oct 13, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Hello I would like to recommend this book I like this book because its content is about Python with appliations in Science and has very interesting programs that we can develop using this language. It explains since the beginning to the most interesting projects. in order to apply them.Beginning with the essentials of Python in data science, you will learn to manage data and perform linear algebra in Python.
Amazon Verified review Amazon
Arunkumar S Jan 25, 2019
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good book with right content
Amazon Verified review Amazon
Natester Oct 12, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Madhavan's book has proven useful for some of the projects I'm working on.The first chapter includes a brief primer on Numpy and Pandas--useful for someone that is new to the Python ecosystem, but assuming you are already familiar with those packages, it should be okay to skip to the second chapter. The second chapter includes some Python statistical examples that I have not seen in other texts, but are important when looking at different types of distributions. These distribution examples and explanations are a must-have in my collection of Python recipes. There are also data visualization tweaks that I've not seen in other Data Science + Python texts.The book also provides an intro to some of the canonical machine learning algorithms (Chapter 5). These examples are great for becoming familiarized with some of the ML algorithms out there without being overwhelmed by all the other algorithms out there.If you are looking for a good primer on Data Science with Python, this is a good book. I'm using the book as a reference more than a primer and the book is also useful.
Amazon Verified review Amazon
jamie May 28, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good
Amazon Verified review Amazon
Jonathan Brett Crawley Oct 09, 2015
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The pace of the book is quite quick, so you will be up to speed in no time. The book gives a nice introduction to the algorithms used in data science, explained well and backed up with source code examples of how to implement them in the Python language. My only criticism would be that there are a number of grammatical errors in the text but they do not obstruct the reader from understanding the material. Overall a good beginners book for getting to know the world of data science
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.

Modal Close icon
Modal Close icon