Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pandas 1.x Cookbook - Second Edition

You're reading from  Pandas 1.x Cookbook - Second Edition

Product type Book
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Pages 626 pages
Edition 2nd Edition
Languages
Authors (2):
Matt Harrison Matt Harrison
Profile icon Matt Harrison
Theodore Petrou Theodore Petrou
Profile icon Theodore Petrou
View More author details

Table of Contents (17) Chapters

Preface Pandas Foundations Essential DataFrame Operations Creating and Persisting DataFrames Beginning Data Analysis Exploratory Data Analysis Selecting Subsets of Data Filtering Rows Index Alignment Grouping for Aggregation, Filtration, and Transformation Restructuring Data into a Tidy Form Combining Pandas Objects Time Series Analysis Visualization with Matplotlib, Pandas, and Seaborn Debugging and Testing Pandas Other Books You May Enjoy
Index

Introduction

Filtering data from a dataset is one of the most common and basic operations. There are numerous ways to filter (or subset) data in pandas with Boolean indexing. Boolean indexing (also known as Boolean selection) can be a confusing term, but in pandas-land, it refers to selecting rows by providing a Boolean array, a pandas Series with the same index, but a True or False for each row. The name comes from NumPy, where similar filtering logic works, so while it is really a Series with Boolean values in it, it is also referred to as a Boolean array.

We will begin by creating Boolean Series and calculating statistics on them and then move on to creating more complex conditionals before using Boolean indexing in a wide variety of ways to filter data.

Calculating Boolean statistics

It can be informative to calculate basic summary statistics on Boolean arrays. Each value of a Boolean array, the True or False, evaluates to 1 or 0 respectively, so all the Series methods that work with numerical values also work with Booleans.

In this recipe, we create a Boolean array by applying a condition to a column of data and then calculate summary statistics from it.

How to do it…

  1. Read in the movie dataset, set the index to the movie title, and inspect the first few rows of the duration column:
    >>> import pandas as pd
    >>> import numpy as np
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... )
    >>> movie[["duration"]].head()
                                                Duration
    movie_title
    Avatar                                         178.0
    Pirates of the Caribbean: At World's End       169.0
    Spectre       ...

Constructing multiple Boolean conditions

In Python, Boolean expressions use the built-in logical operators and, or, and not. These keywords do not work with Boolean indexing in pandas and are respectively replaced with &, |, and ~. Additionally, when combining expressions, each expression must be wrapped in parentheses, or an error will be raised (due to operator precedence).

Constructing a filter for your dataset might require combining multiple Boolean expressions together to pull out the rows you need. In this recipe, we construct multiple Boolean expressions before combining them to find all the movies that have an imdb_score greater than 8, a content_rating of PG-13, and a title_year either before 2000 or after 2009.

How to do it…

  1. Load in the movie dataset and set the title as the index:
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... )
    
  2. Create a variable to hold each...

Filtering with Boolean arrays

Both Series and DataFrame can be filtered with Boolean arrays. You can index this directly off of the object or off of the .loc attribute.

This recipe constructs two complex filters for different rows of movies. The first filters movies with an imdb_score greater than 8, a content_rating of PG-13, and a title_year either before 2000 or after 2009. The second filter consists of those with an imdb_score less than 5, a content_rating of R, and a title_year between 2000 and 2010. Finally, we will combine these filters.

How to do it…

  1. Read in the movie dataset, set the index to movie_title, and create the first set of criteria:
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... )
    >>> crit_a1 = movie.imdb_score > 8
    >>> crit_a2 = movie.content_rating == "PG-13"
    >>> crit_a3 = (movie.title_year < 2000) | (
    ...     movie.title_year...

Comparing row filtering and index filtering

It is possible to replicate specific cases of Boolean selection by taking advantage of the index.

In this recipe, we use the college dataset to select all institutions from a particular state with both Boolean indexing and index selection and then compare each of their performances against one another.

Personally, I prefer to filter by columns (using Boolean arrays) rather than on the index. Column filtering is more powerful as you can use other logical operators and filter on multiple columns.

How to do it…

  1. Read in the college dataset and use Boolean indexing to select all institutions from the state of Texas (TX):
    >>> college = pd.read_csv("data/college.csv")
    >>> college[college["STABBR"] == "TX"].head()
                                INSTNM  ...        GRAD_/_SUPP
    3610  Abilene Christian University  ...              25985
    3611       Alvin Community College...

Selecting with unique and sorted indexes

Index selection performance drastically improves when the index is unique or sorted. The prior recipe used an unsorted index that contained duplicates, which makes for relatively slow selections.

In this recipe, we use the college dataset to form unique or sorted indexes to increase the performance of index selection. We will continue to compare the performance to Boolean indexing as well.

If you are only selecting from a single column and that is a bottleneck for you, this recipe can save you ten times the effort

How to do it…

  1. Read in the college dataset, create a separate DataFrame with STABBR as the index, and check whether the index is sorted:
    >>> college = pd.read_csv("data/college.csv")
    >>> college2 = college.set_index("STABBR")
    >>> college2.index.is_monotonic
    False
    
  2. Sort the index from college2 and store it as another object: ...

Translating SQL WHERE clauses

Many pandas users will have experience of interacting with a database using Structured Query Language (SQL). SQL is a standard to define, manipulate, and control data stored in a database

SQL is an important language for data scientists to know. Much of the world's data is stored in databases that require SQL to retrieve and manipulate it SQL syntax is fairly simple and easy to learn. There are many different SQL implementations from companies such as Oracle, Microsoft, IBM, and more.

Within a SQL SELECT statement, the WHERE clause is very common and filters data. This recipe will write pandas code that is equivalent to a SQL query that selects a certain subset of the employee dataset.

Suppose we are given a task to find all the female employees who work in the police or fire departments who have a base salary of between 80 and 120 thousand dollars.

The following SQL statement would answer this query for us:

SELECT
    UNIQUE_ID...

Improving the readability of Boolean indexing with the query method

Boolean indexing is not necessarily the most pleasant syntax to read or write, especially when using a single line to write a complex filter. pandas has an alternative string-based syntax through the DataFrame query method that can provide more clarity.

This recipe replicates the earlier recipe in this chapter, Translating SQL WHERE clauses, but instead takes advantage of the .query method of the DataFrame. The goal here is to filter the employee data for female employees from the police or fire departments who earn a salary of between 80 and 120 thousand dollars.

How to do it…

  1. Read in the employee data, assign the chosen departments, and import columns to variables:
    >>> employee = pd.read_csv("data/employee.csv")
    >>> depts = [
    ...     "Houston Police Department-HPD",
    ...     "Houston Fire Department (HFD)",
    ... ]
    >>> select_columns...

Preserving Series size with the .where method

When you filter with Boolean arrays, the resulting Series or DataFrame is typically smaller. The .where method preserves the size of your Series or DataFrame and either sets the values that don't meet the criteria to missing or replaces them with something else. Instead of dropping all these values, it is possible to keep them.

When you combine this functionality with the other parameter, you can create functionality similar to coalesce found in databases.

In this recipe, we pass the .where method Boolean conditions to put a floor and ceiling on the minimum and maximum number of Facebook likes for actor 1 in the movie dataset.

How to do it…

  1. Read the movie dataset, set the movie title as the index, and select all the values in the actor_1_facebook_likes column that are not missing:
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... ...

Masking DataFrame rows

The .mask method performs the complement of the .where method. By default, it creates missing values wherever the Boolean condition is True. In essence, it is literally masking, or covering up, values in your dataset.

In this recipe, we will mask all rows of the movie dataset that were made after 2010 and then filter all the rows with missing values.

How to do it…

  1. Read the movie dataset, set the movie title as the index, and create the criteria:
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... )
    >>> c1 = movie["title_year"] >= 2010
    >>> c2 = movie["title_year"].isna()
    >>> criteria = c1 | c2
    
  2. Use the .mask method on a DataFrame to remove the values for all the values in rows with movies that were made from 2010. Any movie that originally had a missing value for title_year is also masked:
    ...

Selecting with Booleans, integer location, and labels

Previously, we covered a wide range of recipes on selecting different subsets of data through the .iloc and .loc attributes. Both of these select rows and columns simultaneously by either integer location or label.

In this recipe, we will filter both rows and columns with the .iloc and .loc attributes.

How to do it…

  1. Read in the movie dataset, set the index as the title, and then create a Boolean array matching all movies with a content rating of G and an IMDB score less than 4:
    >>> movie = pd.read_csv(
    ...     "data/movie.csv", index_col="movie_title"
    ... )
    >>> c1 = movie["content_rating"] == "G"
    >>> c2 = movie["imdb_score"] < 4
    >>> criteria = c1 & c2
    
  2. Let's first pass these criteria to .loc to filter the rows:
    >>> movie_loc = movie.loc[criteria]
    >>> movie_loc...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020 Publisher: Packt ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}