Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pandas 1.x Cookbook - Second Edition

You're reading from  Pandas 1.x Cookbook - Second Edition

Product type Book
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Pages 626 pages
Edition 2nd Edition
Languages
Authors (2):
Matt Harrison Matt Harrison
Profile icon Matt Harrison
Theodore Petrou Theodore Petrou
Profile icon Theodore Petrou
View More author details

Table of Contents (17) Chapters

Preface Pandas Foundations Essential DataFrame Operations Creating and Persisting DataFrames Beginning Data Analysis Exploratory Data Analysis Selecting Subsets of Data Filtering Rows Index Alignment Grouping for Aggregation, Filtration, and Transformation Restructuring Data into a Tidy Form Combining Pandas Objects Time Series Analysis Visualization with Matplotlib, Pandas, and Seaborn Debugging and Testing Pandas Other Books You May Enjoy
Index

Summarizing a DataFrame

In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.

In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.

How to do it...

  1. Read in the movie dataset, and examine the basic descriptive properties, .shape, .size, and .ndim, along with running the len function:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies.shape
    (4916, 28)
    >>> movies.size
    137648
    >>> movies.ndim
    2
    >>> len(movies)
    4916
    
  2. The .count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:
    >>> movies.count()
    color                      4897
    director_name              4814
    num_critic_for_reviews     4867
    duration                   4901
    director_facebook_likes    4814
                               ... 
    title_year                 4810
    actor_2_facebook_likes     4903
    imdb_score                 4916
    aspect_ratio               4590
    movie_facebook_likes       4916
    Length: 28, dtype: int64
    
  3. The other methods that compute summary statistics, .min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:
    >>> movies.min()
    num_critic_for_reviews        1.00
    duration                      7.00
    director_facebook_likes       0.00
    actor_3_facebook_likes        0.00
    actor_1_facebook_likes        0.00
                                ...   
    title_year                 1916.00
    actor_2_facebook_likes        0.00
    imdb_score                    1.60
    aspect_ratio                  1.18
    movie_facebook_likes          0.00
    Length: 16, dtype: float64
    
  4. The .describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:
    >>> movies.describe().T
                   count         mean  ...       75%       max
    num_criti...  4867.0   137.988905  ...    191.00     813.0
    duration      4901.0   107.090798  ...    118.00     511.0
    director_...  4814.0   691.014541  ...    189.75   23000.0
    actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2011.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
    imdb_score    4916.0     6.437429  ...      7.20       9.5
    aspect_ratio  4590.0     2.222349  ...      2.35      16.0
    movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0
    
  5. It is possible to specify exact quantiles in the .describe method using the percentiles parameter:
    >>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
                   count         mean  ...       99%       max
    num_criti...  4867.0   137.988905  ...    546.68     813.0
    duration      4901.0   107.090798  ...    189.00     511.0
    director_...  4814.0   691.014541  ...  16000.00   23000.0
    actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2016.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
    imdb_score    4916.0     6.437429  ...      8.50       9.5
    aspect_ratio  4590.0     2.222349  ...      4.00      16.0
    movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0
    

How it works...

Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.

The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.

If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.

Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.

The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.

There's more...

To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:

>>> movies.min(skipna=False)
num_critic_for_reviews     NaN
duration                   NaN
director_facebook_likes    NaN
actor_3_facebook_likes     NaN
actor_1_facebook_likes     NaN
                          ... 
title_year                 NaN
actor_2_facebook_likes     NaN
imdb_score                 1.6
aspect_ratio               NaN
movie_facebook_likes       0.0
Length: 16, dtype: float64
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020 Publisher: Packt ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}