You're reading from Applying Math with Python - Second Edition

Product type Book

Published in Dec 2022

Publisher Packt

ISBN-13 9781804618370

Pages 376 pages

Edition 2nd Edition

Languages

Concepts

Statistics

Author (1):

Sam Morley

Table of Contents (13) Chapters

Preface

Chapter 1: An Introduction to Basic Packages, Functions, and Concepts

Chapter 2: Mathematical Plotting with Matplotlib

Chapter 3: Calculus and Differential Equations

Chapter 4: Working with Randomness and Probability

Chapter 5: Working with Trees and Networks

Chapter 6: Working with Data and Statistics

Chapter 7: Using Regression and Forecasting

Chapter 8: Geometric Problems

Chapter 9: Finding Optimal Solutions

Chapter 10: Improving Your Productivity

Index

Why subscribe?

Other Books You May Enjoy

Working with Data and Statistics

One of the most attractive features of Python for people who need to analyze data is the huge ecosystem of data manipulation and analysis packages, as well as the active community of data scientists working with Python. Python is easy to use, while also offering very powerful, fast libraries, which enables even relatively novice programmers to quickly and easily process vast sets of data. At the heart of many data science packages and tools is the pandas library. pandas provides two data container types that build on top of NumPy arrays and have good support for labels (other than simple integers). These data containers make working with large sets of data extremely easy.

Data and statistics are part of everything in the modern world. Python is leading the charge in trying to make sense of the vast quantity of data produced every day, and usually, this all starts with pandas – Python’s basic library for working with data. First, we...

What is statistics?

Statistics is the systematic study of data using mathematical – specifically, probability – theory. There are two major aspects to statistics. The first aspect of statistics is summarizing data. This is where we find numerical values that describe a set of data, including characteristics such as the center (mean or median) and spread (standard deviation or variance) of the data. These values are called descriptive statistics. What we’re doing here is fitting a probability distribution that describes the likelihood of a particular characteristic appearing in a population. Here, a population simply means a complete set of measurements of a particular characteristic – for example, the height of every person currently alive on Earth.

The second – and arguably more important – aspect of statistics is inference. Here, we try to estimate the distribution of data describing a population by computing numerical values on a relatively...

Technical requirements

For this chapter, we will mostly make use of the pandas library for data manipulation, which provides R-like data structures, such as Series and DataFrame objects, for storing, organizing, and manipulating data. We will also use the Bokeh data visualization library in the final recipe of this chapter. These libraries can be installed using your favorite package manager, such as pip:

python3.10 -m pip install pandas bokeh

We will also make use of the NumPy and SciPy packages.

The code for this chapter can be found in the Chapter 06 folder of this book’s GitHub repository at https://github.com/PacktPublishing/Applying-Math-with-Python-2nd-Edition/tree/main/Chapter%2006.

Creating Series and DataFrame objects

Most data handling in Python is done using the pandas library, which builds on NumPy to provide R-like structures for holding data. These structures allow the easy indexing of rows and columns, using strings or other Python objects besides just integers. Once data is loaded into a pandas DataFrame or Series, it can be easily manipulated, just as if it were in a spreadsheet. This makes Python, when combined with pandas, a powerful tool for processing and analyzing data.

In this recipe, we will see how to create new pandas Series and DataFrame objects and access items from them.

Getting ready

For this recipe, we will import the pandas library as pd using the following command:

import pandas as pd

The NumPy package is np. We must also create a (seeded) random number generator from NumPy, as follows:

from numpy.random import default_rng
rng = default_rng(12345)

How to do it...

The following steps outline how to create Series...

Loading and storing data from a DataFrame

It is fairly unusual to create a DataFrame object from the raw data in a Python session. In practice, the data will often come from an external source, such as an existing spreadsheet or CSV file, database, or API endpoint. For this reason, pandas provides numerous utilities for loading and storing data to file. Out of the box, pandas supports loading and storing data from CSV, Excel (xls or xlsx), JSON, SQL, Parquet, and Google BigQuery. This makes it very easy to import your data into pandas and then manipulate and analyze this data using Python.

In this recipe, we will learn how to load and store data in a CSV file. The instructions will be similar for loading and storing data in other file formats.

Getting ready

For this recipe, we will need to import the pandas package under the pd alias and the NumPy library as np. We must also create a default random number generator from NumPy using the following commands:

from numpy.random...

Manipulating data in DataFrames

Once we have data in a DataFrame, we often need to apply some simple transformations or filters to the data before we can perform any analysis. This could include, for example, filtering the rows that are missing data or applying a function to individual columns.

In this recipe, we will learn how to perform some basic manipulation of DataFrame objects to prepare the data for analysis.

Getting ready

For this recipe, we will need the pandas package imported under the pd alias, the NumPy package imported under the np alias, and a default random number generator object from NumPy to be created using the following commands:

from numpy.random import default_rng
rng = default_rng(12345)

Let’s learn how to perform some simple manipulations on data in a DataFrame.

How to do it...

The following steps illustrate how to perform some basic filtering and manipulations on a pandas DataFrame:

First, we will create a sample DataFrame...

Plotting data from a DataFrame

As with many mathematical problems, one of the first steps to finding some way to visualize the problem and all the information is to formulate a strategy. For data-based problems, this usually means producing a plot of the data and visually inspecting it for trends, patterns, and the underlying structure. Since this is such a common operation, pandas provides a quick and simple interface for plotting data in various forms, using Matplotlib under the hood by default, directly from a Series or DataFrame.

In this recipe, we will learn how to plot data directly from a DataFrame or Series to understand the underlying trends and structure.

Getting ready

For this recipe, we will need the pandas library imported as pd, the NumPy library imported as np, the Matplotlib pyplot module imported as plt, and a default random number generator instance created using the following commands:

from numpy.random import default_rng
rng = default_rng(12345)

How...

Getting descriptive statistics from a DataFrame

Descriptive statistics, or summary statistics, are simple values associated with a set of data, such as the mean, median, standard deviation, minimum, maximum, and quartile values. These values describe the location and spread of a dataset in various ways. The mean and median are measures of the center (location) of the data, and the other values measure the spread of the data from the mean and median. These statistics are vital for understanding a dataset and form the basis for many techniques for analysis.

In this recipe, we will learn how to generate descriptive statistics for each column in a DataFrame.

Getting ready

For this recipe, we need the pandas package imported as pd, the NumPy package imported as np, the Matplotlib pyplot module imported as plt, and a default random number generator created using the following commands:

from numpy.random import default_rng
rng = default_rng(12345)

How to do it...

The following...

Understanding a population using sampling

One of the central problems in statistics is to make estimations – and quantify how good these estimations are – of the distribution of an entire population given only a small (random) sample. A classic example is to estimate the average height of all the people in a country when measuring the height of a randomly selected sample of people. These kinds of problems are particularly interesting when the true population distribution, by which we usually mean the mean of the whole population, cannot feasibly be measured. In this case, we must rely on our knowledge of statistics and a (usually much smaller) randomly selected sample to estimate the true population mean and standard deviation, and also quantify how good our estimations are. It is the latter that is the source of confusion, misunderstanding, and misrepresentation of statistics in the wider world.

In this recipe, we will learn how to estimate the population mean and...

Performing operations on grouped data in a DataFrame

One of the great features of pandas DataFrames is the ability to group the data by the values in particular columns. For example, we might group assembly line data by the line ID and the shift ID. The ability to operate on this grouped data ergonomically is very important since data is often aggregated for analysis but needs to be grouped for preprocessing.

In this recipe, we will learn how to perform operations on grouped data in a DataFrame. We’ll also take the opportunity to show how to operate on rolling windows of (grouped) data.

Getting ready

For this recipe, we will need the NumPy library imported as np, the Matplotlib pyplot interface imported as plt, and the pandas library imported as pd. We’ll also need an instance of the default random number generator created as follows:

rng = np.random.default_rng(12345)

Before we start, we also need to set up the Matplotlib plotting settings to change the...

Testing hypotheses using t-tests

One of the most common tasks in statistics is to test the validity of a hypothesis about the mean of a normally distributed population, given that you have collected sample data from that population. For example, in quality control, we might wish to test that the thickness of a sheet produced at a mill is 2 mm. To test this, we can randomly select sample sheets and measure the thickness to obtain our sample data. Then, we can use a t-test to test our null hypothesis, , that the mean paper thickness is 2 mm, against the alternative hypothesis, , that the mean paper thickness is not 2 mm. We can use the SciPy stats module to compute a t statistic and a value. If the value is below 0.05, then we accept the null hypothesis with 5% significance (95% confidence). If the value is larger than 0.05, then we must reject the null hypothesis in favor of our alternative hypothesis.

In this recipe, we will learn how to use a t-test to test whether the assumed...

Testing hypotheses using ANOVA

Suppose we have designed an experiment that tests two new processes against the current process and we want to test whether the results of these new processes are different from the current process. In this case, we can use Analysis of Variance (ANOVA) to help us determine whether there are any differences between the mean values of the three sets of results (for this, we need to assume that each sample is drawn from a normal distribution with a common variance).

In this recipe, we will learn how to use ANOVA to compare multiple samples with one another.

Getting ready

For this recipe, we need the SciPy stats module. We will also need to create a default random number generator instance using the following commands:

from numpy.random import default_rng
rng = default_rng(12345)

How to do it...

Follow these steps to perform a (one-way) ANOVA test to test for differences between three different processes:

First, we will create some...

Testing hypotheses for non-parametric data

Both t-tests and ANOVA have a major drawback: the population that is being sampled must follow a normal distribution. In many applications, this is not too restrictive because many real-world population values follow a normal distribution, or some rules, such as the central limit theorem, allow us to analyze some related data. However, it is simply not true that all possible population values follow a normal distribution in any reasonable way. For these (thankfully, rare) cases, we need some alternative test statistics to use as replacements for t-tests and ANOVA.

In this recipe, we will use a Wilcoxon rank-sum test and the Kruskal-Wallis test to test for differences between two (or more, in the latter case) populations.

Getting ready

For this recipe, we will need the pandas package imported as pd, the SciPy stats module, and a default random number generator instance created using the following commands:

from numpy.random import...

Creating interactive plots with Bokeh

Test statistics and numerical reasoning are good for systematically analyzing sets of data. However, they don’t give us a good picture of the whole set of data like a plot would. Numerical values are definitive but can be difficult to understand, especially in statistics, whereas a plot instantly illustrates differences between sets of data and trends. For this reason, there is a large number of libraries for plotting data in even more creative ways. One particularly interesting package for producing plots of data is Bokeh, which allows us to create interactive plots in the browser by leveraging JavaScript libraries.

In this recipe, we will learn how to use Bokeh to create an interactive plot that can be displayed in the browser.

Getting ready

For this recipe, we will need the pandas package imported as pd, the NumPy package imported as np, an instance of the default random number generator constructed with the following code, and...

You're reading from Applying Math with Python - Second Edition

Table of Contents (13) Chapters

Working with Data and Statistics

What is statistics?

Technical requirements

Creating Series and DataFrame objects

Getting ready

How to do it...

Loading and storing data from a DataFrame

Getting ready

Manipulating data in DataFrames

Getting ready

How to do it...

Plotting data from a DataFrame

Getting ready

How...

Getting descriptive statistics from a DataFrame

Getting ready

How to do it...

Understanding a population using sampling

Performing operations on grouped data in a DataFrame

Getting ready

Testing hypotheses using t-tests

Testing hypotheses using ANOVA

Getting ready

How to do it...

Testing hypotheses for non-parametric data

Getting ready

Creating interactive plots with Bokeh

Getting ready

Further reading

Authors (1)

Personalised recommendations for you

You're reading from Applying Math with Python - Second Edition

Table of Contents (13) Chapters

Working with Data and Statistics

What is statistics?

Technical requirements

Creating Series and DataFrame objects

Getting ready

How to do it...

Loading and storing data from a DataFrame

Getting ready

Manipulating data in DataFrames

Getting ready

How to do it...

Plotting data from a DataFrame

Getting ready

How...

Getting descriptive statistics from a DataFrame

Getting ready

How to do it...

Understanding a population using sampling

Performing operations on grouped data in a DataFrame

Getting ready

Testing hypotheses using t-tests

Testing hypotheses using ANOVA

Getting ready

How to do it...

Testing hypotheses for non-parametric data

Getting ready

Creating interactive plots with Bokeh

Getting ready

Further reading

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you