Introduction
This chapter covers many fundamental operations of the DataFrame. Many of the recipes will be similar to those in Chapter 1, Pandas Foundations, which primarily covered operations on a Series.
Selecting multiple DataFrame columns
We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.
In this recipe, all the actor and director columns will be selected from the movie dataset.
How to do it...
- Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:
>>> import pandas as pd >>> import numpy as np >>> movies = pd.read_csv("data/movie.csv") >>> movie_actor_director = movies[ ... [ ... "actor_1_name", ... "actor_2_name", ... "actor_3_name", ... "director_name", ... ] ... ] >>> movie_actor_director.head() actor_1_name actor_2_name...
Selecting columns with methods
Although column selection is usually done with the indexing operator, there are some DataFrame methods that facilitate their selection in an alternative manner. The .select_dtypes
and .filter
methods are two useful methods to do this.
If you want to select by type, you need to be familiar with pandas data types. The Understanding data types recipe in Chapter 1, Pandas Foundations, explains the types.
How to do it...
- Read in the movie dataset. Shorten the column names for display. Use the
.get_dtype_counts
method to output the number of columns with each specific data type:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return ( ... str(col) ... .replace("facebook_likes", "fb") ... .replace("_for_reviews", "") ... ) >>> movies = movies.rename(columns=shorten) >>> movies.dtypes.value_counts...
Ordering column names
One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.
There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.
The following is a guideline to order columns:
- Classify each column as either categorical or continuous
- Group common columns within the categorical and continuous columns
- Place the most important groups of columns first with categorical columns before continuous ones
This recipe shows you how to order the columns...
Summarizing a DataFrame
In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.
In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.
How to do it...
- Read in the movie dataset, and examine the basic descriptive properties,
.shape
,.size
, and.ndim
, along with running thelen
function:>>> movies = pd.read_csv("data/movie.csv") >>> movies.shape (4916, 28) >>> movies.size 137648 >>> movies.ndim 2 >>> len(movies) 4916...
Chaining DataFrame methods
The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.
In this recipe, we count all the missing values in each column of the movie dataset.
How to do it...
- We will use the
.isnull
method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:>>> movies = pd.read_csv("data/movie.csv") >>> def shorten(col): ... return col.replace("facebook_likes", "fb").replace( ... "_for_reviews", "" ... ) >>> movies = movies.rename(columns=shorten) >>> movies...
DataFrame operations
A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.
When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5
to each value of the DataFrame raises a TypeError
as integers cannot be added to strings:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not...
Comparing missing values
pandas uses the NumPy NaN (np.nan
) object to represent a missing value. This is an unusual object and has interesting mathematical properties. For instance, it is not equal to itself. Even Python's None
object evaluates as True
when compared to itself:
>>> np.nan == np.nan
False
>>> None == None
True
All other comparisons against np.nan
also return False
, except not equal to (!=
):
>>> np.nan > 5
False
>>> 5 > np.nan
False
>>> np.nan != 5
True
Getting ready
Series and DataFrames use the equals operator, ==
, to make element-by-element comparisons. The result is an object with the same dimensions. This recipe shows you how to use the equals operator, which is very different from the .equals
method.
As in the previous recipe, the columns representing the fraction of each race of undergraduate students from the college dataset will be used:
>>> college = pd.read_csv(...
Transposing the direction of a DataFrame operation
Many DataFrame methods have an axis
parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index'
(or 0
) or 'columns'
(or 1
). I prefer the string versions are they are more explicit and tend to make the code easier to read.
Nearly all DataFrame methods default the axis parameter to 0
, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.
How to do it...
- Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
>>> college = pd.read_csv( ... "data/college.csv", index_col="INSTNM" ... ) >>> college_ugds = college.filter(like="UGDS_") >>> college_ugds.head() UGDS_WHITE UGDS_BLACK...
Determining college campus diversity
Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA ...