Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

When Series or DataFrames are combined, each dimension of the data automatically aligns on each axis first before any computation happens. This silent and automatic alignment of axes can confuse the uninitiated, but it gives flexibility to the power user. This chapter explores the Index object in-depth before showcasing a variety of recipes that take advantage of its automatic alignment.

Examining the Index object

As was discussed previously, each axis of a Series and a DataFrame has an Index object that labels the values. There are many different types of Index objects, but they all share common behavior. All Index objects, except for the MultiIndex, are single-dimensional data structures that combine the functionality of Python sets and NumPy ndarrays.

In this recipe, we will examine the column index of the college dataset and explore much of its functionality.

How to do it…

  1. Read in the college dataset, and create a variable columns that holds the column index:
    >>> import pandas as pd
    >>> import numpy as np
    >>> college = pd.read_csv("data/college.csv")
    >>> columns = college.columns
    >>> columns
    Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
           'SATVRMID&apos...

Producing Cartesian products

Whenever a Series or DataFrame operates with another Series or DataFrame, the indexes (both the row index and column index) of each object align first before any operation begins. This index alignment happens behind the scenes and can be very surprising for those new to pandas. This alignment always creates a Cartesian product between the indexes unless the indexes are identical.

A Cartesian product is a mathematical term that usually appears in set theory. A Cartesian product between two sets is all the combinations of pairs of both sets. For example, the 52 cards in a standard playing card deck represent a Cartesian product between the 13 ranks (A, 2, 3,…, Q, K) and the four suits.

Producing a Cartesian product isn't always the intended outcome, but it's essential to be aware of how and when it occurs so as to avoid unintended consequences. In this recipe, two Series with overlapping but non-identical indexes are added together...

Exploding indexes

The previous recipe walked through a trivial example of two small Series being added together with unequal indexes. This recipe is more of an "anti-recipe" of what not to do. The Cartesian product of index alignment can produce comically incorrect results when dealing with larger amounts of data.

In this recipe, we add two larger Series that have indexes with only a few unique values but in different orders. The result will explode the number of values in the indexes.

How to do it…

  1. Read in the employee data and set the index to the RACE column:
    >>> employee = pd.read_csv(
    ...     "data/employee.csv", index_col="RACE"
    ... )
    >>> employee.head()
                  UNIQUE_ID POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
    RACE                                    ...                        
    Hispanic/...          0  ASSISTAN...    ...  2006-06-12  2012-10-13
    Hispanic/...          1  LIBRARY ...   ...

Filling values with unequal indexes

When two Series are added together using the plus operator and one of the index labels does not appear in the other, the resulting value is always missing. pandas has the .add method, which provides an option to fill the missing value. Note that these Series do not include duplicate entries, hence there is no need to worry about a Cartesian product exploding the number of entries.

In this recipe, we add together multiple Series from the baseball dataset with unequal (but unique) indexes using the .add method with the fill_value parameter to ensure that there are no missing values in the result.

How to do it…

  1. Read in the three baseball datasets and set playerID as the index:
    >>> baseball_14 = pd.read_csv(
    ...     "data/baseball14.csv", index_col="playerID"
    ... )
    >>> baseball_15 = pd.read_csv(
    ...     "data/baseball15.csv", index_col="playerID"
    ... )
    >>...

Adding columns from different DataFrames

All DataFrames can add new columns to themselves. However, as usual, whenever a DataFrame is adding a new column from another DataFrame or Series, the indexes align first, and then the new column is created.

This recipe uses the employee dataset to append a new column containing the maximum salary of that employee's department.

How to do it…

  1. Import the employee data and select the DEPARTMENT and BASE_SALARY columns in a new DataFrame:
    >>> employee = pd.read_csv("data/employee.csv")
    >>> dept_sal = employee[["DEPARTMENT", "BASE_SALARY"]]
    
  2. Sort this smaller DataFrame by salary within each department:
    >>> dept_sal = dept_sal.sort_values(
    ...     ["DEPARTMENT", "BASE_SALARY"],
    ...     ascending=[True, False],
    ... )
    
  3. Use the .drop_duplicates method to keep the first row of each DEPARTMENT: ...

Highlighting the maximum value from each column

The college dataset has many numeric columns describing different metrics about each school. Many people are interested in schools that perform the best for specific metrics.

This recipe discovers the school that has the maximum value for each numeric column and styles the DataFrame to highlight the information.

How to do it…

  1. Read the college dataset with the institution name as the index:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college.dtypes
    CITY                   object
    STABBR                 object
    HBCU                  float64
    MENONLY               float64
    WOMENONLY             float64
                           ...
    PCTPELL               float64
    PCTFLOAN              float64
    UG25ABV               float64
    MD_EARN_WNE_P10        object
    GRAD_DEBT_MDN_SUPP     object
    Length: 26, dtype: object
    
  2. All the other columns...

Replicating idxmax with method chaining

A good exercise is to attempt an implementation of a built-in DataFrame method on your own. This type of replication can give you a deeper understanding of other pandas methods that you normally wouldn't have come across. .idxmax is a challenging method to replicate using only the methods covered thus far in the book.

This recipe slowly chains together basic methods to eventually find all the row index values that contain a maximum column value.

How to do it…

  1. Load in the college dataset and execute the same operations as the previous recipe to get only the numeric columns that are of interest:
    >>> def remove_binary_cols(df):
    ...     binary_only = df.nunique() == 2
    ...     cols = binary_only[binary_only].index.tolist()
    ...     return df.drop(columns=cols)
    >>> college_n = (
    ...     college
    ...     .assign(
    ...         MD_EARN_WNE_P10=pd.to_numeric(
    ...             college.MD_EARN_WNE_P10...

Finding the most common maximum of columns

The college dataset contains the undergraduate population percentage of eight different races for over 7,500 colleges. It would be interesting to find the race with the highest undergrad population for each school and then find the distribution of this result for the entire dataset. We would be able to answer a question like, "What percentage of institutions have more White students than any other race?"

In this recipe, we find the race with the highest percentage of the undergraduate population for each school with the .idxmax method and then find the distribution of these maximums.

How to do it…

  1. Read in the college dataset and select just those columns with undergraduate race percentage information:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    >>> college_ugds...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou