Introduction
This chapter covers many fundamental operations of the DataFrame. Many of the recipes will be similar to those in Chapter 1, Pandas Foundations, which primarily covered operations on a Series.
Selecting multiple DataFrame columns
We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.
In this recipe, all the actor and director columns will be selected from the movie dataset.
How to do it...
- Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:
>>> import pandas as pd >>> import numpy as np >>> movies = pd.read_csv("data/movie.csv") >>> movie_actor_director = movies[ ... [ ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... "director_name", ... ] ... ] >>> movie_actor_director.head() actor_1_name actor_2_name actor_3_name director_name 0 CCH Pounder Joel Dav... Wes Studi James Ca... 1 Johnny Depp Orlando ... Jack Dav... Gore Ver... 2 Christop... Rory Kin... Stephani... Sam Mendes 3 Tom Hardy Christia... Joseph G... Christop... 4 Doug Walker Rob Walker NaN Doug Walker
- There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
>>> type(movies[["director_name"]]) <class 'pandas.core.frame.DataFrame'> >>> type(movies["director_name"]) <class 'pandas.core.series.Series'>
- We can also use
.loc
to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:
) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:>>> type(movies.loc[:, ["director_name"]]) <class 'pandas.core.frame.DataFrame'> >>> type(movies.loc[:, "director_name"]) <class 'pandas.core.series.Series'>
How it works...
The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.
Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.
Step 3 shows how to use the loc
attribute to pull out a Series or a DataFrame.
There's more...
Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:
>>> cols = [
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
>>> movie_actor_director = movies[cols]
One of the most common exceptions raised when working with pandas is KeyError
. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:
>>> movies[
... "actor_1_name",
... "actor_2_name",
... "actor_3_name",
... "director_name",
... ]
Traceback (most recent call last):
...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
Selecting columns with methods
Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes
and .filter
methods are two useful methods to do this.
If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.
How to do it...
- Read in the movie dataset. Shorten the column names for display. Use the
.get_dtype_counts
method to output the number of columns with each specific data type:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return ( ... str(col) ... .replace("facebook_likes", "fb") ... .replace("_for_reviews", "") ... ) >>> movies = movies.rename(columns=shorten) >>> movies.dtypes.value_counts() float64 13 int64 3 object 12 dtype: int64
- Use the
.select_dtypes
method to select only the integer columns:>>> movies.select_dtypes(include="int").head() num_voted_users cast_total_fb movie_fb 0 886204 4834 33000 1 471220 48350 0 2 275868 11700 85000 3 1144337 106759 164000 4 8 143 0
- If you would like to select all the numeric columns, you may pass the string
number
to theinclude
parameter:>>> movies.select_dtypes(include="number").head() num_critics duration ... aspect_ratio movie_fb 0 723.0 178.0 ... 1.78 33000 1 302.0 169.0 ... 2.35 0 2 602.0 148.0 ... 2.35 85000 3 813.0 164.0 ... 2.35 164000 4 NaN NaN ... NaN 0
- If we wanted integer and string columns we could do the following:
>>> movies.select_dtypes(include=["int", "object"]).head() color direc/_name ... conte/ating movie_fb 0 Color James Cameron ... PG-13 33000 1 Color Gore Verbinski ... PG-13 0 2 Color Sam Mendes ... PG-13 85000 3 Color Christopher Nolan ... PG-13 164000 4 NaN Doug Walker ... NaN 0
- To exclude only floating-point columns, do the following:
>>> movies.select_dtypes(exclude="float").head() color director_name ... content_rating movie_fb 0 Color James Ca... ... PG-13 33000 1 Color Gore Ver... ... PG-13 0 2 Color Sam Mendes ... PG-13 85000 3 Color Christop... ... PG-13 164000 4 NaN Doug Walker ... NaN 0
- An alternative method to select columns is with the
.filter
method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use thelike
parameter to search for all the Facebook columns or the names that contain the exact string,fb
. Thelike
parameter is checking for substrings in column names:>>> movies.filter(like="fb").head() director_fb actor_3_fb ... actor_2_fb movie_fb 0 0.0 855.0 ... 936.0 33000 1 563.0 1000.0 ... 5000.0 0 2 0.0 161.0 ... 393.0 85000 3 22000.0 23000.0 ... 23000.0 164000 4 131.0 NaN ... 12.0 0
- The
.filter
method has more tricks (or parameters) up its sleeve. If you use theitems
parameters, you can pass in a list of column names:>>> cols = [ ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... "director_name", ... ] >>> movies.filter(items=cols).head() actor_1_name ... director_name 0 CCH Pounder ... James Cameron 1 Johnny Depp ... Gore Verbinski 2 Christoph Waltz ... Sam Mendes 3 Tom Hardy ... Christopher Nolan 4 Doug Walker ... Doug Walker
- The
.filter
method allows columns to be searched with regular expressions using theregex
parameter. Here, we search for all columns that have a digit somewhere in their name:>>> movies.filter(regex=r"\d").head() actor_3_fb actor_2_name ... actor_3_name actor_2_fb 0 855.0 Joel Dav... ... Wes Studi 936.0 1 1000.0 Orlando ... ... Jack Dav... 5000.0 2 161.0 Rory Kin... ... Stephani... 393.0 3 23000.0 Christia... ... Joseph G... 23000.0 4 NaN Rob Walker ... NaN 12.0
How it works...
Step 1 lists the frequencies of all the different data types. Alternatively, you may use the .dtypes
attribute to get the exact data type for each column. The .select_dtypes
method accepts either a list or single data type in its include
or exclude
parameters and returns a DataFrame with columns of just those given data types (or not those types if excluding columns). The list values may be either the string name of the data type or the actual Python object.
The .filter
method selects columns by only inspecting the column names and not the actual data values. It has three mutually exclusive parameters: items
, like
, and regex
, only one of which can be used at a time.
The like
parameter takes a string and attempts to find all the column names that contain that exact string somewhere in the name. To gain more flexibility, you may use the regex
parameter instead to select column names through a regular expression. This particular regular expression, r'\d'
, represents all digits from zero to nine and matches any string with at least a single digit in it.
The filter method comes with another parameter, items
, which takes a list of exact column names. This is nearly an exact duplication of the index operation, except that a KeyError
will not be raised if one of the strings does not match a column name. For instance, movies.filter(items=['actor_1_name', 'asdf'])
runs without error and returns a single column DataFrame.
There's more...
One confusing aspect of .select_dtypes
is its flexibility to take both strings and Python objects. The following list should clarify all the possible ways to select the many different column data types. There is no standard or preferred method of referring to data types in pandas, so it's good to be aware of both ways:
np.number
,'number'
– Selects both integers and floats regardless of sizenp.float64
,np.float_
,float
,'float64'
,'float_'
,'float'
– Selects only 64-bit floatsnp.float16
,np.float32
,np.float128
,'float16'
,'float32'
,'float128'
– Respectively selects exactly 16, 32, and 128-bit floatsnp.floating
,'floating'
– Selects all floats regardless of sizenp.int0
,np.int64
,np.int_
,int
,'int0'
,'int64'
,'int_'
,'int'
– Selects only 64-bit integersnp.int8
,np.int16
,np.int32
,'int8'
,'int16'
,'int32'
– Respectively selects exactly 8, 16, and 32-bit integersnp.integer
,'integer'
– Selects all integers regardless of size'Int64'
– Selects nullable integer; no NumPy equivalentnp.object
,'object'
,'O'
– Select all object data typesnp.datetime64
,'datetime64'
,'datetime'
– All datetimes are 64 bitsnp.timedelta64
,'timedelta64'
,'timedelta'
– All timedeltas are 64 bitspd.Categorical
,'category'
– Unique to pandas; no NumPy equivalent
Because all integers and floats default to 64 bits, you may select them by using the string 'int'
or 'float'
as you can see from the preceding bullet list. If you want to select all integers and floats regardless of their specific size, use the string 'number'
.
Ordering column names
One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.
There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.
The following is a guideline to order columns:
- Classify each column as either categorical or continuous
- Group common columns within the categorical and continuous columns
- Place the most important groups of columns first with categorical columns before continuous ones
This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.
How to do it...
- Read in the movie dataset, and scan the data:
>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return col.replace("facebook_likes", "fb").replace( ... "_for_reviews", "" ... ) >>> movies = movies.rename(columns=shorten)
- Output all the column names and scan for similar categorical and continuous columns:
>>> movies.columns Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb', 'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio', 'movie_fb'], dtype='object')
- The columns don't appear to have any logical ordering to them. Organize the names sensibly into lists so that the guideline from the previous section is followed:
>>> cat_core = [ ... "movie_title", ... "title_year", ... "content_rating", ... "genres", ... ] >>> cat_people = [ ... "director_name", ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... ] >>> cat_other = [ ... "color", ... "country", ... "language", ... "plot_keywords", ... "movie_imdb_link", ... ] >>> cont_fb = [ ... "director_fb", ... "actor_1_fb", ... "actor_2_fb", ... "actor_3_fb", ... "cast_total_fb", ... "movie_fb", ... ] >>> cont_finance = ["budget", "gross"] >>> cont_num_reviews = [ ... "num_voted_users", ... "num_user", ... "num_critic", ... ] >>> cont_other = [ ... "imdb_score", ... "duration", ... "aspect_ratio", ... "facenumber_in_poster", ... ]
- Concatenate all the lists together to get the final column order. Also, ensure that this list contains all the columns from the original:
>>> new_col_order = ( ... cat_core ... + cat_people ... + cat_other ... + cont_fb ... + cont_finance ... + cont_num_reviews ... + cont_other ... ) >>> set(movies.columns) == set(new_col_order) True
- Pass the list with the new column order to the indexing operator of the DataFrame to reorder the columns:
>>> movies[new_col_order].head() movie_title title_year ... aspect_ratio facenumber_in_poster 0 Avatar 2009.0 ... 1.78 0.0 1 Pirates ... 2007.0 ... 2.35 0.0 2 Spectre 2015.0 ... 2.35 1.0 3 The Dark... 2012.0 ... 2.35 0.0 4 Star War... NaN ... NaN 0.0
How it works...
You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']]
creates a new DataFrame with only the movie_title
and director_name
columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.
Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.
Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.
Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.
There's more...
There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.
Summarizing a DataFrame
In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.
In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.
How to do it...
- Read in the movie dataset, and examine the basic descriptive properties,
.shape
,.size
, and.ndim
, along with running thelen
function:>>> movies = pd.read_csv("data/movie.csv") >>> movies.shape (4916, 28) >>> movies.size 137648 >>> movies.ndim 2 >>> len(movies) 4916
- The
.count
method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:>>> movies.count() color 4897 director_name 4814 num_critic_for_reviews 4867 duration 4901 director_facebook_likes 4814 ... title_year 4810 actor_2_facebook_likes 4903 imdb_score 4916 aspect_ratio 4590 movie_facebook_likes 4916 Length: 28, dtype: int64
- The other methods that compute summary statistics,
.min
,.max
,.mean
,.median
, and.std
, return Series that have the column names of the numeric columns in the index and their aggregations as the values:>>> movies.min() num_critic_for_reviews 1.00 duration 7.00 director_facebook_likes 0.00 actor_3_facebook_likes 0.00 actor_1_facebook_likes 0.00 ... title_year 1916.00 actor_2_facebook_likes 0.00 imdb_score 1.60 aspect_ratio 1.18 movie_facebook_likes 0.00 Length: 16, dtype: float64
- The
.describe
method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using.T
as I can usually fit more information on the screen that way:>>> movies.describe().T count mean ... 75% max num_criti... 4867.0 137.988905 ... 191.00 813.0 duration 4901.0 107.090798 ... 118.00 511.0 director_... 4814.0 691.014541 ... 189.75 23000.0 actor_3_f... 4893.0 631.276313 ... 633.00 23000.0 actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0 ... ... ... ... ... ... title_year 4810.0 2002.447609 ... 2011.00 2016.0 actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0 imdb_score 4916.0 6.437429 ... 7.20 9.5 aspect_ratio 4590.0 2.222349 ... 2.35 16.0 movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
- It is possible to specify exact quantiles in the
.describe
method using thepercentiles
parameter:>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T count mean ... 99% max num_criti... 4867.0 137.988905 ... 546.68 813.0 duration 4901.0 107.090798 ... 189.00 511.0 director_... 4814.0 691.014541 ... 16000.00 23000.0 actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0 actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0 ... ... ... ... ... ... title_year 4810.0 2002.447609 ... 2016.00 2016.0 actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0 imdb_score 4916.0 6.437429 ... 8.50 9.5 aspect_ratio 4590.0 2.222349 ... 4.00 16.0 movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
How it works...
Step 1 gives basic information on the size of the dataset. The .shape
attribute returns a tuple with the number of rows and columns. The .size
attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim
attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len
function, it returns the number of rows.
The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.
If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.
Note that numeric columns have missing values but have a result returned by .describe
. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna
parameter to False
. This will cause pandas to return NaN
for all these aggregation methods if there exists at least a single missing value.
The .describe
method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles
parameter. See the Developing a data analysis routine recipe for more on the .describe
method.
There's more...
To see how the .skipna
parameter affects the outcome, we can set its value to False
and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64
Chaining DataFrame methods
The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.
In this recipe, we count all the missing values in each column of the movie dataset.
How to do it...
- We will use the
.isnull
method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return col.replace("facebook_likes", "fb").replace( ... "_for_reviews", "" ... ) >>> movies = movies.rename(columns=shorten) >>> movies.isnull().head() color director_name ... aspect_ratio movie_fb 0 False False ... False False 1 False False ... False False 2 False False ... False False 3 False False ... False False 4 True False ... True False
- We will chain the
.sum
method that interpretsTrue
andFalse
as1
and0
, respectively. Because this is a reduction method, it aggregates the results into a Series:>>> (movies.isnull().sum().head()) color 19 director_name 102 num_critic 49 duration 15 director_fb 102 dtype: int64
- We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value:
>>> movies.isnull().sum().sum() 2654
- A way to determine whether there are any missing values in the DataFrame is to use the
.any
method twice in succession:>>> movies.isnull().any().any() True
How it works...
The .isnull
method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:
>>> movies.isnull().dtypes.value_counts()
bool 28
dtype: int64
In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum
method, which gets us the grand total of missing values in the DataFrame.
In step 4, the .any
method on a DataFrame returns a Series of Booleans indicating if there exists at least one True
for each column. The .any
method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True
, then there is at least one missing value in the entire DataFrame.
There's more...
Most of the columns in the movie dataset with the object
data type contain missing values. By default, aggregation methods (.min
, .max
, and .sum
), do not return anything for object
columns. as seen in the following code snippet, which selects three object
columns and attempts to find the maximum value of each one:
>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)
To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:
>>> movies.select_dtypes(["object"]).fillna("").max()
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:
>>> (movies.select_dtypes(["object"]).fillna("").max())
color Color
director_name Étienne Faure
actor_2_name Zubaida Sahar
genres Western
actor_1_name Óscar Jaenada
...
plot_keywords zombie|zombie spoof
movie_imdb_link http://www.imdb....
language Zulu
country West Germany
content_rating X
Length: 12, dtype: object
DataFrame operations
A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.
When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5
to each value of the DataFrame raises a TypeError
as integers cannot be added to strings:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str
To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'
. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter
method:
>>> colleges = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round
method.
How to do it...
- pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the
UGDS_BLACK
row of this series when we round it to two decimal places:>>> name = "Northwest-Shoals Community College" >>> college_ugds.loc[name] UGDS_WHITE 0.7912 UGDS_BLACK 0.1250 UGDS_HISP 0.0339 UGDS_ASIAN 0.0036 UGDS_AIAN 0.0088 UGDS_NHPI 0.0006 UGDS_2MOR 0.0012 UGDS_NRA 0.0033 UGDS_UNKN 0.0324 Name: Northwest-Shoals Community College, dtype: float64 >>> college_ugds.loc[name].round(2) UGDS_WHITE 0.79 UGDS_BLACK 0.12 UGDS_HISP 0.03 UGDS_ASIAN 0.00 UGDS_AIAN 0.01 UGDS_NHPI 0.00 UGDS_2MOR 0.00 UGDS_NRA 0.00 UGDS_UNKN 0.03 Name: Northwest-Shoals Community College, dtype: float64
If we add
.0001
before rounding, it changes to rounding up:>>> (college_ugds.loc[name] + 0.0001).round(2) UGDS_WHITE 0.79 UGDS_BLACK 0.13 UGDS_HISP 0.03 UGDS_ASIAN 0.00 UGDS_AIAN 0.01 UGDS_NHPI 0.00 UGDS_2MOR 0.00 UGDS_NRA 0.00 UGDS_UNKN 0.03 Name: Northwest-Shoals Community College, dtype: float64
- Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add
.00501
to each value ofcollege_ugds
:>>> college_ugds + 0.00501 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03831 0.94031 ... 0.01091 0.01881 Universit... 0.59721 0.26501 ... 0.02291 0.01501 Amridge U... 0.30401 0.42421 ... 0.00501 0.27651 Universit... 0.70381 0.13051 ... 0.03821 0.04001 Alabama S... 0.02081 0.92581 ... 0.02931 0.01871 ... ... ... ... ... ... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- Use the floor division operator,
//
, to round down to the nearest whole number percentage:>>> (college_ugds + 0.00501) // 0.01 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 3.0 94.0 ... 1.0 1.0 Universit... 59.0 26.0 ... 2.0 1.0 Amridge U... 30.0 42.0 ... 0.0 27.0 Universit... 70.0 13.0 ... 3.0 4.0 Alabama S... 2.0 92.0 ... 2.0 1.0 ... ... ... ... ... ... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- To complete the rounding exercise, divide by
100
:>>> college_ugds_op_round =( ... (college_ugds + 0.00501) // 0.01 / 100 ... ) >>> college_ugds_op_round.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03 0.94 ... 0.01 0.01 Universit... 0.59 0.26 ... 0.02 0.01 Amridge U... 0.30 0.42 ... 0.00 0.27 Universit... 0.70 0.13 ... 0.03 0.04 Alabama S... 0.02 0.92 ... 0.02 0.01
- Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:
>>> college_ugds_round = (college_ugds + 0.00001).round(2) >>> college_ugds_round UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03 0.94 ... 0.01 0.01 Universit... 0.59 0.26 ... 0.02 0.01 Amridge U... 0.30 0.42 ... 0.00 0.27 Universit... 0.70 0.13 ... 0.03 0.04 Alabama S... 0.02 0.92 ... 0.02 0.01 ... ... ... ... ... .... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- Use the equals DataFrame method to test the equality of two DataFrames:
>>> college_ugds_op_round.equals(college_ugds_round) True
How it works...
Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.
Mathematically, adding .005
should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:
>>> 0.045 + 0.005
0.049999999999999996
There is an extra .00001
added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.
Step 3 applies the floor division operator, //
, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100
and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.
In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001
to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).
It is necessary here to round up so that both DataFrame values are equal. The .equals
method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.
There's more...
Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:
>>> college2 = (
... college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True
Comparing missing values
pandas uses the NumPy NaN (np.nan
) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None
object evaluates as True
when compared to itself:
>>> np.nan == np.nan
False
>>> None == None
True
All other comparisons against np.nan
also return False
, except not equal to (!=
):
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
Getting ready
Series and DataFrames use the equals operator, ==
, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals
method.
As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:
>>> college = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")
How to do it...
- To get an idea of how the equals operator works, let's compare each element to a scalar value:
>>> college_ugds == 0.0019 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... False False ... False False Universit... False False ... False False Amridge U... False False ... False False Universit... False False ... False False Alabama S... False False ... False False ... ... ... ... ... ... SAE Insti... False False ... False False Rasmussen... False False ... False False National ... False False ... False False Bay Area ... False False ... False False Excel Lea... False False ... False False
- This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. You may be tempted to use the equals operator to compare two DataFrames with one another on an element-by-element basis. Take, for instance,
college_ugds
compared against itself, as follows:>>> college_self_compare = college_ugds == college_ugds >>> college_self_compare.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... True True ... True True Universit... True True ... True True Amridge U... True True ... True True Universit... True True ... True True Alabama S... True True ... True True
- At first glance, all the values appear to be equal, as you would expect. However, using the
.all
method to determine if each column contains onlyTrue
values yields an unexpected result:>>> college_self_compare.all() UGDS_WHITE False UGDS_BLACK False UGDS_HISP False UGDS_ASIAN False UGDS_AIAN False UGDS_NHPI False UGDS_2MOR False UGDS_NRA False UGDS_UNKN False dtype: bool
- This happens because missing values do not compare equally with one another. If you tried to count missing values using the equal operator and summing up the Boolean columns, you would get zero for each one:
>>> (college_ugds == np.nan).sum() UGDS_WHITE 0 UGDS_BLACK 0 UGDS_HISP 0 UGDS_ASIAN 0 UGDS_AIAN 0 UGDS_NHPI 0 UGDS_2MOR 0 UGDS_NRA 0 UGDS_UNKN 0 dtype: int64
- Instead of using
==
to find missing numbers, use the.isna
method:>>> college_ugds.isna().sum() UGDS_WHITE 661 UGDS_BLACK 661 UGDS_HISP 661 UGDS_ASIAN 661 UGDS_AIAN 661 UGDS_NHPI 661 UGDS_2MOR 661 UGDS_NRA 661 UGDS_UNKN 661 dtype: int64
- The correct way to compare two entire DataFrames with one another is not with the equals operator (
==
) but with the.equals
method. This method treats NaNs that are in the same location as equal (note that the.eq
method is the equivalent of==
):>>> college_ugds.equals(college_ugds) True
How it works...
Step 1 compares a DataFrame to a scalar value while step 2 compares a DataFrame with another DataFrame. Both operations appear to be quite simple and intuitive at first glance. The second operation is checking whether the DataFrames have identically labeled indexes and thus the same number of elements. The operation will fail if this isn't the case.
Step 3 verifies that none of the columns in the DataFrames are equivalent to each other. Step 4 further shows the non-equivalence of np.nan
and itself. Step 5 verifies that there are indeed missing values in the DataFrame. Finally, step 6 shows the correct way to compare DataFrames with the .equals
method, which always returns a Boolean scalar value.
There's more...
All the comparison operators have method counterparts that allow for more functionality. Somewhat confusingly, the .eq
DataFrame method does element-by-element comparison, just like the equals (==
) operator. The .eq
method is not at all the same as the .equals
method. The following code duplicates step 1:
>>> college_ugds.eq(0.0019) # same as college_ugds == .0019
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... False False ... False False
Universit... False False ... False False
Amridge U... False False ... False False
Universit... False False ... False False
Alabama S... False False ... False False
... ... ... ... ... ...
SAE Insti... False False ... False False
Rasmussen... False False ... False False
National ... False False ... False False
Bay Area ... False False ... False False
Excel Lea... False False ... False False
Inside the pandas.testing
sub-package, a function exists that developers should use when creating unit tests. The assert_frame_equal
function raises an AssertionError
if two DataFrames are not equal. It returns None
if the two DataFrames are equal:
>>> from pandas.testing import assert_frame_equal
>>> assert_frame_equal(college_ugds, college_ugds) is None
True
Unit tests are a very important part of software development and ensure that the code is running correctly. pandas contains many thousands of unit tests that help ensure that it is running properly. To read more on how pandas runs its unit tests, see the Contributing to pandas section in the documentation (http://bit.ly/2vmCSU6).
Transposing the direction of a DataFrame operation
Many DataFrame methods have an axis
parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index'
(or 0
) or 'columns'
(or 1
). I prefer the string versions are they are more explicit and tend to make the code easier to read.
Nearly all DataFrame methods default the axis parameter to 0
, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.
How to do it...
- Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
>>> college = pd.read_csv( ... "data/college.csv", index_col="INSTNM" ... ) >>> college_ugds = college.filter(like="UGDS_") >>> college_ugds.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.0333 0.9353 ... 0.0059 0.0138 Universit... 0.5922 0.2600 ... 0.0179 0.0100 Amridge U... 0.2990 0.4192 ... 0.0000 0.2715 Universit... 0.6988 0.1255 ... 0.0332 0.0350 Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
- Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The
.count
method returns the number of non-missing values. By default, itsaxis
parameter is set to0
:>>> college_ugds.count() UGDS_WHITE 6874 UGDS_BLACK 6874 UGDS_HISP 6874 UGDS_ASIAN 6874 UGDS_AIAN 6874 UGDS_NHPI 6874 UGDS_2MOR 6874 UGDS_NRA 6874 UGDS_UNKN 6874 dtype: int64
The
axis
parameter is almost always set to0
. So, step 2 is equivalent to bothcollege_ugds.count(axis=0)
andcollege_ugds.count(axis='index')
. - Changing the axis parameter to
'columns'
changes the direction of the operation so that we get back a count of non-missing items in each row:>>> college_ugds.count(axis="columns").head() INSTNM Alabama A & M University 9 University of Alabama at Birmingham 9 Amridge University 9 University of Alabama in Huntsville 9 Alabama State University 9 dtype: int64
- Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The
.sum
method may be used to verify this:>>> college_ugds.sum(axis="columns").head() INSTNM Alabama A & M University 1.0000 University of Alabama at Birmingham 0.9999 Amridge University 1.0000 University of Alabama in Huntsville 1.0000 Alabama State University 1.0000 dtype: float64
- To get an idea of the distribution of each column, the
.median
method can be used:>>> college_ugds.median(axis="index") UGDS_WHITE 0.55570 UGDS_BLACK 0.10005 UGDS_HISP 0.07140 UGDS_ASIAN 0.01290 UGDS_AIAN 0.00260 UGDS_NHPI 0.00000 UGDS_2MOR 0.01750 UGDS_NRA 0.00000 UGDS_UNKN 0.01430 dtype: float64
How it works...
The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).
There's more...
The .cumsum
method with axis=1
accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:
>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9686 ... 0.9862 1.0000
Universit... 0.5922 0.8522 ... 0.9899 0.9999
Amridge U... 0.2990 0.7182 ... 0.7285 1.0000
Universit... 0.6988 0.8243 ... 0.9650 1.0000
Alabama S... 0.0158 0.9366 ... 0.9863 1.0000
Determining college campus diversity
Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.
How to do it...
- Read in the college dataset, and filter for just the undergraduate race columns:
>>> college = pd.read_csv( ... "data/college.csv", index_col="INSTNM" ... ) >>> college_ugds = college.filter(like="UGDS_")
- Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:
>>> ( ... college_ugds.isnull() ... .sum(axis="columns") ... .sort_values(ascending=False) ... .head() ... ) INSTNM Excel Learning Center-San Antonio South 9 Philadelphia College of Osteopathic Medicine 9 Assemblies of God Theological Seminary 9 Episcopal Divinity School 9 Phillips Graduate Institute 9 dtype: int64
- Now that we have seen the colleges that are missing all their race columns, we can use the
.dropna
method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:>>> college_ugds = college_ugds.dropna(how="all") >>>; college_ugds.isnull().sum() UGDS_WHITE 0 UGDS_BLACK 0 UGDS_HISP 0 UGDS_ASIAN 0 UGDS_AIAN 0 UGDS_NHPI 0 UGDS_2MOR 0 UGDS_NRA 0 UGDS_UNKN 0 dtype: int64
- There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method,
.ge
, to return a DataFrame with a Boolean value for each cell:>>> college_ugds.ge(0.15) UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... False True ... False False Universit... True True ... False False Amridge U... True True ... False True Universit... True False ... False False Alabama S... False True ... False False ... ... ... ... ... ... Hollywood... True True ... False False Hollywood... False True ... False False Coachella... True False ... False False Dewey Uni... False False ... False False Coastal P... True True ... False False
- From here, we can use the
.sum
method to count theTrue
values for each college. Notice that a Series is returned:>>> diversity_metric = college_ugds.ge(0.15).sum( ... axis="columns" ... ) >>> diversity_metric.head() INSTNM Alabama A & M University 1 University of Alabama at Birmingham 2 Amridge University 3 University of Alabama in Huntsville 1 Alabama State University 1 dtype: int64
- To get an idea of the distribution, we will use the
.value_counts
method on this Series:>>> diversity_metric.value_counts() 1 3042 2 2884 3 876 4 63 0 7 5 2 dtype: int64
- Amazingly, two schools have more than 15% in five different race categories. Let's sort the
diversity_metric
Series to find out which ones they are:>>> diversity_metric.sort_values(ascending=False).head() INSTNM Regency Beauty Institute-Austin 5 Central Texas Beauty College-Temple 5 Sullivan and Cogliano Training Center 4 Ambria College of Nursing 4 Berkeley College-New York 4 dtype: int64
- It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use
.loc
to select rows based on the index label:>>> college_ugds.loc[ ... [ ... "Regency Beauty Institute-Austin", ... "Central Texas Beauty College-Temple", ... ] ... ] UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Regency B... 0.1867 0.2133 ... 0.0 0.2667 Central T... 0.1616 0.2323 ... 0.0 0.1515
- It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:
>>> us_news_top = [ ... "Rutgers University-Newark", ... "Andrews University", ... "Stanford University", ... "University of Houston", ... "University of Nevada-Las Vegas", ... ] >>> diversity_metric.loc[us_news_top] INSTNM Rutgers University-Newark 4 Andrews University 3 Stanford University 3 University of Houston 3 University of Nevada-Las Vegas 3 dtype: int64
How it works...
Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna
method in step 3 has the how
parameter, which defaults to the string 'any'
, but may also be changed to 'all'
. When set to 'any'
, it drops rows that contain one or more missing values. When set to 'all'
, it only drops rows where all values are missing.
In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna
method was performed. If there were still missing values, we could have run the .fillna(0)
method to fill all the remaining values with 0.
Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge
. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'
.
The .value_counts
method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.
Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.
There's more...
Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
We can also determine if any school has all nine race categories exceeding 1%:
>>> (college_ugds > 0.01).all(axis=1).any()
True