Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pandas 1.x Cookbook - Second Edition

You're reading from  Pandas 1.x Cookbook - Second Edition

Product type Book
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Pages 626 pages
Edition 2nd Edition
Languages
Authors (2):
Matt Harrison Matt Harrison
Profile icon Matt Harrison
Theodore Petrou Theodore Petrou
Profile icon Theodore Petrou
View More author details

Table of Contents (17) Chapters

Preface 1. Pandas Foundations 2. Essential DataFrame Operations 3. Creating and Persisting DataFrames 4. Beginning Data Analysis 5. Exploratory Data Analysis 6. Selecting Subsets of Data 7. Filtering Rows 8. Index Alignment 9. Grouping for Aggregation, Filtration, and Transformation 10. Restructuring Data into a Tidy Form 11. Combining Pandas Objects 12. Time Series Analysis 13. Visualization with Matplotlib, Pandas, and Seaborn 14. Debugging and Testing Pandas 15. Other Books You May Enjoy
16. Index

Chaining DataFrame methods

The Chaining Series methods recipe in Chapter 1, Pandas Foundations, showcased several examples of chaining Series methods together. All the method chains in this chapter will begin from a DataFrame. One of the keys to method chaining is to know the exact object being returned during each step of the chain. In pandas, this will nearly always be a DataFrame, Series, or scalar value.

In this recipe, we count all the missing values in each column of the movie dataset.

How to do it...

  1. We will use the .isnull method to get a count of the missing values. This method will change every value to a Boolean, indicating whether it is missing:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.isnull().head()
       color  director_name  ...  aspect_ratio  movie_fb
    0  False        False    ...        False      False
    1  False        False    ...        False      False
    2  False        False    ...        False      False
    3  False        False    ...        False      False
    4   True        False    ...         True      False
    
  2. We will chain the .sum method that interprets True and False as 1 and 0, respectively. Because this is a reduction method, it aggregates the results into a Series:
    >>> (movies.isnull().sum().head())
    color             19
    director_name    102
    num_critic        49
    duration          15
    director_fb      102
    dtype: int64
    
  3. We can go one step further and take the sum of this Series and return the count of the total number of missing values in the entire DataFrame as a scalar value:
    >>> movies.isnull().sum().sum()
    2654
    
  4. A way to determine whether there are any missing values in the DataFrame is to use the .any method twice in succession:
    >>> movies.isnull().any().any()
    True
    

How it works...

The .isnull method returns a DataFrame the same size as the calling DataFrame but with all values transformed to Booleans. See the counts of the following data types to verify this:

>>> movies.isnull().dtypes.value_counts()
bool    28
dtype: int64

In Python, Booleans evaluate to 0 and 1, and this makes it possible to sum them by column, as done in step 2. The resulting Series itself also has a .sum method, which gets us the grand total of missing values in the DataFrame.

In step 4, the .any method on a DataFrame returns a Series of Booleans indicating if there exists at least one True for each column. The .any method is chained again on this resulting Series of Booleans to determine if any of the columns have missing values. If step 4 evaluates as True, then there is at least one missing value in the entire DataFrame.

There's more...

Most of the columns in the movie dataset with the object data type contain missing values. By default, aggregation methods (.min, .max, and .sum), do not return anything for object columns. as seen in the following code snippet, which selects three object columns and attempts to find the maximum value of each one:

>>> movies[["color", "movie_title", "color"]].max()
Series([], dtype: float64)

To force pandas to return something for each column, we must fill in the missing values. Here, we choose an empty string:

>>> movies.select_dtypes(["object"]).fillna("").max()
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object

For purposes of readability, method chains are often written as one method call per line surrounded by parentheses. This makes it easier to read and insert comments on what is returned at each step of the chain, or comment out lines to debug what is happening:

>>> (movies.select_dtypes(["object"]).fillna("").max())
color                            Color
director_name            Étienne Faure
actor_2_name             Zubaida Sahar
genres                         Western
actor_1_name             Óscar Jaenada
                          ...         
plot_keywords      zombie|zombie spoof
movie_imdb_link    http://www.imdb....
language                          Zulu
country                   West Germany
content_rating                       X
Length: 12, dtype: object
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020 Publisher: Packt ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}