Home

Programming

Pandas 1.x Cookbook - Second Edition

By Matt Harrison , Theodore Petrou

Book + AI Assistant

eBook + AI Assistant $63.99 $43.99

Print $79.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook + AI Assistant $63.99 $43.99

Print $79.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Publication date:: February 2020
Publisher: Packt
Pages: 626
ISBN: 9781839213106
Download code from GitHub

Introduction

This chapter covers many fundamental operations of the DataFrame. Many of the recipes will be similar to those in Chapter 1, Pandas Foundations, which primarily covered operations on a Series.

Selecting multiple DataFrame columns

We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.

In this recipe, all the actor and director columns will be selected from the movie dataset.

How to do it...

Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:

>>> import pandas as pd
>>> import numpy as np
>>> movies = pd.read_csv("data/movie.csv")
>>> movie_actor_director = movies[
...     [
...         "actor_1_name",
...         "actor_2_name",
...         "actor_3_name",
...         "director_name",
...     ]
... ]
>>> movie_actor_director.head()
  actor_1_name actor_2_name actor_3_name director_name
0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
2  Christop...  Rory Kin...  Stephani...   Sam Mendes
3    Tom Hardy  Christia...  Joseph G...  Christop...
4  Doug Walker   Rob Walker          NaN  Doug Walker

There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
```
>>> type(movies[["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies["director_name"])
<class 'pandas.core.series.Series'>
```
We can also use .loc to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:
```
>>> type(movies.loc[:, ["director_name"]])
<class 'pandas.core.frame.DataFrame'>
>>> type(movies.loc[:, "director_name"])
<class 'pandas.core.series.Series'>
```

How it works...

The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.

Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.

Step 3 shows how to use the loc attribute to pull out a Series or a DataFrame.

There's more...

Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

One of the most common exceptions raised when working with pandas is KeyError. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

Selecting columns with methods

Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes and .filter methods are two useful methods to do this.

If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.

How to do it...

Read in the movie dataset. Shorten the column names for display. Use the .get_dtype_counts method to output the number of columns with each specific data type:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return (
...         str(col)
...         .replace("facebook_likes", "fb")
...         .replace("_for_reviews", "")
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.dtypes.value_counts()
float64    13
int64       3
object     12
dtype: int64

Use the .select_dtypes method to select only the integer columns:

>>> movies.select_dtypes(include="int").head()
   num_voted_users  cast_total_fb  movie_fb
0           886204           4834     33000
1           471220          48350         0
2           275868          11700     85000
3          1144337         106759    164000
4                8            143         0

If you would like to select all the numeric columns, you may pass the string number to the include parameter:

>>> movies.select_dtypes(include="number").head()
   num_critics  duration  ...  aspect_ratio  movie_fb
0        723.0     178.0  ...         1.78      33000
1        302.0     169.0  ...         2.35          0
2        602.0     148.0  ...         2.35      85000
3        813.0     164.0  ...         2.35     164000
4          NaN       NaN  ...          NaN          0

If we wanted integer and string columns we could do the following:

>>> movies.select_dtypes(include=["int", "object"]).head()
   color        direc/_name  ... conte/ating movie_fb
0  Color      James Cameron  ...       PG-13    33000
1  Color     Gore Verbinski  ...       PG-13        0
2  Color         Sam Mendes  ...       PG-13    85000
3  Color  Christopher Nolan  ...       PG-13   164000
4    NaN        Doug Walker  ...         NaN        0

To exclude only floating-point columns, do the following:

>>> movies.select_dtypes(exclude="float").head()
   color director_name  ... content_rating movie_fb
0  Color  James Ca...   ...        PG-13      33000
1  Color  Gore Ver...   ...        PG-13          0
2  Color   Sam Mendes   ...        PG-13      85000
3  Color  Christop...   ...        PG-13     164000
4    NaN  Doug Walker   ...          NaN          0

An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names:

>>> movies.filter(like="fb").head()
   director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
0          0.0       855.0  ...       936.0     33000
1        563.0      1000.0  ...      5000.0         0
2          0.0       161.0  ...       393.0     85000
3      22000.0     23000.0  ...     23000.0    164000
4        131.0         NaN  ...        12.0         0

The .filter method has more tricks (or parameters) up its sleeve. If you use the items parameters, you can pass in a list of column names:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movies.filter(items=cols).head()
      actor_1_name  ...      director_name
0      CCH Pounder  ...      James Cameron
1      Johnny Depp  ...     Gore Verbinski
2  Christoph Waltz  ...         Sam Mendes
3        Tom Hardy  ...  Christopher Nolan
4      Doug Walker  ...        Doug Walker

The .filter method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:

>>> movies.filter(regex=r"\d").head()
   actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
0       855.0  Joel Dav...  ...    Wes Studi       936.0
1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
2       161.0  Rory Kin...  ...  Stephani...       393.0
3     23000.0  Christia...  ...  Joseph G...     23000.0
4         NaN   Rob Walker  ...          NaN        12.0

How it works...

Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes attribute to get the exact data type for each column. The .select_dtypes method accepts either a list or single data type in its include or exclude parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.

The .filter method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items, like, and regex, only one of which can be used at a time.

The like parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex parameter instead to select column names through a regular expression. This particular regular expression, r'\d', represents all digits from zero to nine and matches any string with at least a single digit in it.

The filter method comes with another parameter, items, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf']) runs without error and returns a single column DataFrame.

There's more...

One confusing aspect of .select_dtypes is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:

np.number, 'number' – Selects both integers and floats regardless of size
np.float64, np.float_, float, 'float64', 'float_', 'float' – Selects only 64-bit floats
np.float16, np.float32, np.float128, 'float16', 'float32', 'float128' – Respectively selects exactly 16, 32, and 128-bit floats
np.floating, 'floating' – Selects all floats regardless of size
np.int0, np.int64, np.int_, int, 'int0', 'int64', 'int_', 'int' – Selects only 64-bit integers
np.int8, np.int16, np.int32, 'int8', 'int16', 'int32' – Respectively selects exactly 8, 16, and 32-bit integers
np.integer, 'integer' – Selects all integers regardless of size
'Int64' – Selects nullable integer; no NumPy equivalent
np.object, 'object', 'O' – Select all object data types
np.datetime64, 'datetime64', 'datetime' – All datetimes are 64 bits
np.timedelta64, 'timedelta64', 'timedelta' – All timedeltas are 64 bits
pd.Categorical, 'category' – Unique to pandas; no NumPy equivalent

Because all integers and floats default to 64 bits, you may select them by using the string 'int' or 'float' as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'.

Ordering column names

One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.

There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.

The following is a guideline to order columns:

Classify each column as either categorical or continuous
Group common columns within the categorical and continuous columns
Place the most important groups of columns first with categorical columns before continuous ones

This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.

How to do it...

Read in the movie dataset, and scan the data:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)

Output all the column names and scan for similar categorical and continuous columns:

>>> movies.columns
Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

The columns don't appear to have any logical ordering to them. Organize the names sensibly into lists so that the guideline from the previous section is followed:

>>> cat_core = [
...     "movie_title",
...     "title_year",
...     "content_rating",
...     "genres",
... ]
>>> cat_people = [
...     "director_name",
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
... ]
>>> cat_other = [
...     "color",
...     "country",
...     "language",
...     "plot_keywords",
...     "movie_imdb_link",
... ]
>>> cont_fb = [
...     "director_fb",
...     "actor_1_fb",
...     "actor_2_fb",
...     "actor_3_fb",
...     "cast_total_fb",
...     "movie_fb",
... ]
>>> cont_finance = ["budget", "gross"]
>>> cont_num_reviews = [
...     "num_voted_users",
...     "num_user",
...     "num_critic",
... ]
>>> cont_other = [
...     "imdb_score",
...     "duration",
...     "aspect_ratio",
...     "facenumber_in_poster",
... ]

Concatenate all the lists together to get the final column order. Also, ensure that this list contains all the columns from the original:

>>> new_col_order = (
...     cat_core
...     + cat_people
...     + cat_other
...     + cont_fb
...     + cont_finance
...     + cont_num_reviews
...     + cont_other
... )
>>> set(movies.columns) == set(new_col_order)
True

Pass the list with the new column order to the indexing operator of the DataFrame to reorder the columns:

>>> movies[new_col_order].head()
   movie_title  title_year  ... aspect_ratio facenumber_in_poster
0       Avatar      2009.0  ...         1.78          0.0
1  Pirates ...      2007.0  ...         2.35          0.0
2      Spectre      2015.0  ...         2.35          1.0
3  The Dark...      2012.0  ...         2.35          0.0
4  Star War...         NaN  ...          NaN          0.0

How it works...

You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']] creates a new DataFrame with only the movie_title and director_name columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.

Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.

Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.

Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.

There's more...

There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.

Summarizing a DataFrame

In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.

In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.

How to do it...

Read in the movie dataset, and examine the basic descriptive properties, .shape, .size, and .ndim, along with running the len function:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
>>> len(movies)
4916

The .count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:

>>> movies.count()
color                      4897
director_name              4814
num_critic_for_reviews     4867
duration                   4901
director_facebook_likes    4814
                           ... 
title_year                 4810
actor_2_facebook_likes     4903
imdb_score                 4916
aspect_ratio               4590
movie_facebook_likes       4916
Length: 28, dtype: int64

The other methods that compute summary statistics, .min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:

>>> movies.min()
num_critic_for_reviews        1.00
duration                      7.00
director_facebook_likes       0.00
actor_3_facebook_likes        0.00
actor_1_facebook_likes        0.00
                            ...   
title_year                 1916.00
actor_2_facebook_likes        0.00
imdb_score                    1.60
aspect_ratio                  1.18
movie_facebook_likes          0.00
Length: 16, dtype: float64

The .describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:

>>> movies.describe().T
               count         mean  ...       75%       max
num_criti...  4867.0   137.988905  ...    191.00     813.0
duration      4901.0   107.090798  ...    118.00     511.0
director_...  4814.0   691.014541  ...    189.75   23000.0
actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2011.00    2016.0
actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
imdb_score    4916.0     6.437429  ...      7.20       9.5
aspect_ratio  4590.0     2.222349  ...      2.35      16.0
movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0

It is possible to specify exact quantiles in the .describe method using the percentiles parameter:

>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
               count         mean  ...       99%       max
num_criti...  4867.0   137.988905  ...    546.68     813.0
duration      4901.0   107.090798  ...    189.00     511.0
director_...  4814.0   691.014541  ...  16000.00   23000.0
actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
...              ...          ...  ...       ...       ...
title_year    4810.0  2002.447609  ...   2016.00    2016.0
actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
imdb_score    4916.0     6.437429  ...      8.50       9.5
aspect_ratio  4590.0     2.222349  ...      4.00      16.0
movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0

How it works...

Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.

The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.

If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.

Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.

The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.

There's more...

To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64

Chaining DataFrame methods

The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.

In this recipe, we count all the missing values in each column of the movie dataset.

How to do it...

We will use the .isnull method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:

>>> movies = pd.read_csv("data/movie.csv")
>>> def shorten(col):
...     return col.replace("facebook_likes", "fb").replace(
...         "_for_reviews", ""
...     )
>>> movies = movies.rename(columns=shorten)
>>> movies.isnull().head()
   color  director_name  ...  aspect_ratio  movie_fb
0  False        False    ...        False      False
1  False        False    ...        False      False
2  False        False    ...        False      False
3  False        False    ...        False      False
4   True        False    ...         True      False

We will chain the .sum method that interprets True and False as 1 and 0, respectively. Because this is a reduction method, it aggregates the results into a Series:

>>> (movies.isnull().sum().head())
color             19
director_name    102
num_critic        49
duration          15
director_fb      102
dtype: int64

We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value:
```
>>> movies.isnull().sum().sum()
2654
```
A way to determine whether there are any missing values in the DataFrame is to use the .any method twice in succession:
```
>>> movies.isnull().any().any()
True
```

How it works...

The .isnull method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:

>>> movies.isnull().dtypes.value_counts()
bool    28
dtype: int64

In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum method, which gets us the grand total of missing values in the DataFrame.

In step 4, the .any method on a DataFrame returns a Series of Booleans indicating if there exists at least one True for each column. The .any method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True, then there is at least one missing value in the entire DataFrame.

There's more...

Most of the columns in the movie dataset with the object data type contain missing values. By default, aggregation methods (.min, .max, and .sum), do not return anything for object columns. as seen in the following code snippet, which selects three object columns and attempts to find the maximum value of each one:

>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)

To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:

>>> movies.select_dtypes(["object"]).fillna("").max()
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:

>>> (movies.select_dtypes(["object"]).fillna("").max())
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

DataFrame operations

A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5 to each value of the DataFrame raises a TypeError as integers cannot be added to strings:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter method:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round method.

How to do it...

pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the UGDS_BLACK row of this series when we round it to two decimal places:

>>> name = "Northwest-Shoals Community College"
>>> college_ugds.loc[name]
UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64
>>> college_ugds.loc[name].round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

If we add .0001 before rounding, it changes to rounding up:

>>> (college_ugds.loc[name] + 0.0001).round(2)
UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add .00501 to each value of college_ugds:

>>> college_ugds + 0.00501
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
Universit...     0.59721     0.26501  ...   0.02291    0.01501
Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
Universit...     0.70381     0.13051  ...   0.03821    0.04001
Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

Use the floor division operator, //, to round down to the nearest whole number percentage:

>>> (college_ugds + 0.00501) // 0.01
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...         3.0        94.0  ...       1.0        1.0
Universit...        59.0        26.0  ...       2.0        1.0
Amridge U...        30.0        42.0  ...       0.0       27.0
Universit...        70.0        13.0  ...       3.0        4.0
Alabama S...         2.0        92.0  ...       2.0        1.0
...                  ...         ...  ...       ...        ...
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

To complete the rounding exercise, divide by 100:

>>> college_ugds_op_round =(
...     (college_ugds + 0.00501) // 0.01 / 100
... )
>>> college_ugds_op_round.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01

Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:

>>> college_ugds_round = (college_ugds + 0.00001).round(2)
>>> college_ugds_round
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...        0.03        0.94  ...      0.01       0.01
Universit...        0.59        0.26  ...      0.02       0.01
Amridge U...        0.30        0.42  ...      0.00       0.27
Universit...        0.70        0.13  ...      0.03       0.04
Alabama S...        0.02        0.92  ...      0.02       0.01
...                  ...         ...  ...       ...        ....
SAE Insti...         NaN         NaN  ...       NaN        NaN
Rasmussen...         NaN         NaN  ...       NaN        NaN
National ...         NaN         NaN  ...       NaN        NaN
Bay Area ...         NaN         NaN  ...       NaN        NaN
Excel Lea...         NaN         NaN  ...       NaN        NaN

Use the equals DataFrame method to test the equality of two DataFrames:
```
>>> college_ugds_op_round.equals(college_ugds_round)
True
```

How it works...

Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.

Mathematically, adding .005 should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:

>>> 0.045 + 0.005
0.049999999999999996

There is an extra .00001 added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.

Step 3 applies the floor division operator, //, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100 and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.

In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001 to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).

It is necessary here to round up so that both DataFrame values are equal. The .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.

There's more...

Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True

Comparing missing values

pandas uses the NumPy NaN (np.nan) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None object evaluates as True when compared to itself:

>>> np.nan == np.nan
False
>>> None == None
True

All other comparisons against np.nan also return False, except not equal to (!=):

>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True

Getting ready

Series and DataFrames use the equals operator, ==, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals method.

As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

How to do it...

To get an idea of how the equals operator works, let's compare each element to a scalar value:

>>> college_ugds == 0.0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance, college_ugds compared against itself, as follows:

>>> college_self_compare = college_ugds == college_ugds
>>> college_self_compare.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Alabama A...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Amridge U...        True        True  ...      True       True
Universit...        True        True  ...      True       True
Alabama S...        True        True  ...      True       True

At first glance, all the values appear to be equal, as you would expect. However, using the .all method to determine if each column contains only True values yields an unexpected result:

>>> college_self_compare.all()
UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:

>>> (college_ugds == np.nan).sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

Instead of using == to find missing numbers, use the .isna method:

>>> college_ugds.isna().sum()
UGDS_WHITE    661
UGDS_BLACK    661
UGDS_HISP     661
UGDS_ASIAN    661
UGDS_AIAN     661
UGDS_NHPI     661
UGDS_2MOR     661
UGDS_NRA      661
UGDS_UNKN     661
dtype: int64

The correct way to compare two entire DataFrames with one another is not with the equals operator (==) but with the .equals method. This method treats NaNs that are in the same location as equal (note that the .eq method is the equivalent of ==):
```
>>> college_ugds.equals(college_ugds)
True
```

How it works...

Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.

Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals method, which always returns a Boolean scalar value.

There's more...

All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq DataFrame method does element-by-element comparison, just like the equals (==) operator. The .eq method is not at all the same as the .equals method. The following code duplicates step 1:

>>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Amridge U...       False       False  ...     False      False
Universit...       False       False  ...     False      False
Alabama S...       False       False  ...     False      False
...                  ...         ...  ...       ...        ...
SAE Insti...       False       False  ...     False      False
Rasmussen...       False       False  ...     False      False
National ...       False       False  ...     False      False
Bay Area ...       False       False  ...     False      False
Excel Lea...       False       False  ...     False      False

Inside the pandas.testing sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal function raises an AssertionError if two DataFrames are not equal. It returns None if the two DataFrames are equal:

>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True

Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).

Transposing the direction of a DataFrame operation

Many DataFrame methods have an axis parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index' (or 0) or 'columns' (or 1). I prefer the string versions are they are more explicit and tend to make the code easier to read.

Nearly all DataFrame methods default the axis parameter to 0, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.

How to do it...

Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The .count method returns the number of non-missing values. By default, its axis parameter is set to 0:
```
>>> college_ugds.count()
UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64
```
The axis parameter is almost always set to 0. So, step 2 is equivalent to both college_ugds.count(axis=0) and college_ugds.count(axis='index').

Changing the axis parameter to 'columns' changes the direction of the operation so that we get back a count of non-missing items in each row:

>>> college_ugds.count(axis="columns").head()
INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The .sum method may be used to verify this:

>>> college_ugds.sum(axis="columns").head()
INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

To get an idea of the distribution of each column, the .median method can be used:

>>> college_ugds.median(axis="index")
UGDS_WHITE    0.55570
UGDS_BLACK    0.10005
UGDS_HISP     0.07140
UGDS_ASIAN    0.01290
UGDS_AIAN     0.00260
UGDS_NHPI     0.00000
UGDS_2MOR     0.01750
UGDS_NRA      0.00000
UGDS_UNKN     0.01430
dtype: float64

How it works...

The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).

There's more...

The .cumsum method with axis=1 accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000

Determining college campus diversity

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

How to do it...

Read in the college dataset, and filter for just the undergraduate race columns:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:

>>> (
...     college_ugds.isnull()
...     .sum(axis="columns")
...     .sort_values(ascending=False)
...     .head()
... )
INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

Now that we have seen the colleges that are missing all their race columns, we can use the .dropna method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:

>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, .ge, to return a DataFrame with a Boolean value for each cell:

>>> college_ugds.ge(0.15)
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False        True  ...     False      False
Universit...        True        True  ...     False      False
Amridge U...        True        True  ...     False       True
Universit...        True       False  ...     False      False
Alabama S...       False        True  ...     False      False
...                  ...         ...  ...       ...        ...
Hollywood...        True        True  ...     False      False
Hollywood...       False        True  ...     False      False
Coachella...        True       False  ...     False      False
Dewey Uni...       False       False  ...     False      False
Coastal P...        True        True  ...     False      False

From here, we can use the .sum method to count the True values for each college. Notice that a Series is returned:

>>> diversity_metric = college_ugds.ge(0.15).sum(
...     axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

To get an idea of the distribution, we will use the .value_counts method on this Series:

>>> diversity_metric.value_counts()
1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

Amazingly, two schools have more than 15% in five different race categories. Let's sort the diversity_metric Series to find out which ones they are:

>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use .loc to select rows based on the index label:

>>> college_ugds.loc[
...     [
...         "Regency Beauty Institute-Austin",
...         "Central Texas Beauty College-Temple",
...     ]
... ]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Regency B...      0.1867      0.2133  ...       0.0     0.2667
Central T...      0.1616      0.2323  ...       0.0     0.1515

It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:

>>> us_news_top = [
...     "Rutgers University-Newark",
...     "Andrews University",
...     "Stanford University",
...     "University of Houston",
...     "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

How it works...

Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna method in step 3 has the how parameter, which defaults to the string 'any', but may also be changed to 'all'. When set to 'any', it drops rows that contain one or more missing values. When set to 'all', it only drops rows where all values are missing.

In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna method was performed. If there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0.

Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'.

The .value_counts method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.

Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.

There's more...

Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

>>> (college_ugds > 0.01).all(axis=1).any()
True

About the Authors

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Browse publications by this author
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Browse publications by this author

Machine Learning for Algorithmic Trading - Second Edition

Leverage machine learning to design and back-test automated trading strategies for real-world markets using pandas, TA-Lib, scikit-learn, LightGBM, SpaCy, Gensim, TensorFlow 2, Zipline, backtrader, Alphalens, and pyfolio.

By Stefan Jansen

The Python Workshop

Learn the fundamentals of clean, effective Python coding and build the practical skills to tackle your own software development or data science projects

By Andrew Bird and 4 more

Data Engineering with Python

Build, monitor, and manage real-time data pipelines to create data engineering infrastructure efficiently using open-source Apache projects

By Paul Crickard

Kubernetes and Docker - An Enterprise Guide

Apply Kubernetes beyond the basics of Kubernetes clusters by implementing IAM using OIDC and Active Directory, Layer 4 load balancing using MetalLB, advanced service integration, security, auditing, and CI/CD

By Scott Surovich and 1 more