You're reading from Pandas 1.x Cookbook - Second Edition

Product type Book

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Pages 626 pages

Edition 2nd Edition

Languages

Python

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

DataFrame operations

A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5 to each value of the DataFrame raises a TypeError as integers cannot be added to strings:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter method:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round method.

How to do it...

pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the UGDS_BLACK row of this series when we round it to two decimal places:

>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

If we add .0001 before rounding, it changes to rounding up:

>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add .00501 to each value of college_ugds:

>>> college_ugds + 0.00501
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
Universit...     0.59721     0.26501  ...   0.02291    0.01501
Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
Universit...     0.70381     0.13051  ...   0.03821    0.04001
Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

Use the floor division operator, //, to round down to the nearest whole number percentage:

>>> (college_ugds + 0.00501) // 0.01
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...         3.0        94.0  ...       1.0        1.0
Universit...        59.0        26.0  ...       2.0        1.0
Amridge U...        30.0        42.0  ...       0.0       27.0
Universit...        70.0        13.0  ...       3.0        4.0
Alabama S...         2.0        92.0  ...       2.0        1.0
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

To complete the rounding exercise, divide by 100:

>>> college_ugds_op_round =(
...     (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01

Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:

>>> college_ugds_round = (college_ugds + 0.00001).round(2)
>>> college_ugds_round
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01
...                  ...         ...  ...       ...        ....
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

Use the equals DataFrame method to test the equality of two DataFrames:
```
>>> college_ugds_op_round.equals(college_ugds_round)
True
```

How it works...

Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.

Mathematically, adding .005 should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:

>>> 0.045 + 0.005
0.049999999999999996

There is an extra .00001 added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.

Step 3 applies the floor division operator, //, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100 and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.

In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001 to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).

It is necessary here to round up so that both DataFrame values are equal. The .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.

There's more...

Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True

You're reading from Pandas 1.x Cookbook - Second Edition

Table of Contents (17) Chapters

DataFrame operations

How to do it...

How it works...

There's more...

Authors (2)

Other recommended products

Personalised recommendations for you