Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

One of the most fundamental tasks during data analysis involves splitting data into independent groups before performing a calculation on each group. This methodology has been around for quite some time but has more recently been referred to as split-apply-combine. This chapter covers the powerful .groupby method, which allows you to group your data in any way imaginable and apply any type of function independently to each group before returning a single dataset.

Before we get started with the recipes, we will need to know just a little terminology. All basic groupby operations have grouping columns, and each unique combination of values in these columns represents an independent grouping of the data. The syntax looks as follows:

df.groupby(['list', 'of', 'grouping', 'columns'])
df.groupby('single_column')  # when grouping by a single column 

The result of calling the .groupby method is a groupby object...

Defining an aggregation

In this recipe, we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function. We will find the average arrival delay for each airline. pandas has different syntaxes to create an aggregation, and this recipe will show them.

How to do it…

  1. Read in the flights dataset:
    >>> import pandas as pd
    >>> import numpy as np
    >>> flights = pd.read_csv('data/flights.csv')
    >>> flights.head()
    0      1    1        4  ...      65.0        0         0
    1      1    1        4  ...     -13.0        0         0
    2      1    1        4  ...      35.0        0         0
    3      1    1        4  ...      -7.0        0         0
    4      1    1        4  ...      39.0        0         0
    
  2. Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place...

Grouping and aggregating with multiple columns and functions

It is possible to group and aggregate with multiple columns. The syntax is slightly different than it is for grouping and aggregating with a single column. As usual with any kind of grouping operation, it helps to identify the three components: the grouping columns, aggregating columns, and aggregating functions.

In this recipe, we showcase the flexibility of the .groupby method by answering the following queries:

  • Finding the number of canceled flights for every airline per weekday
  • Finding the number and percentage of canceled and diverted flights for every airline per weekday
  • For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime

How to do it…

  1. Read in the flights dataset, and answer the first query by defining the grouping columns (AIRLINE, WEEKDAY), the aggregating...

Removing the MultiIndex after grouping

Inevitably, when using groupby, you will create a MultiIndex. MultiIndexes can happen in both the index and the columns. DataFrames with MultiIndexes are more difficult to navigate and occasionally have confusing column names as well.

In this recipe, we perform an aggregation with the .groupby method to create a DataFrame with a MultiIndex for the rows and columns. Then, we manipulate the index so that it has a single level and the column names are descriptive.

How to do it…

  1. Read in the flights dataset, write a statement to find the total and average miles flown, and the maximum and minimum arrival delay for each airline for each weekday:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> airline_info = (flights
    ...     .groupby(['AIRLINE', 'WEEKDAY'])
    ...     .agg({'DIST':['sum', 'mean'],
    ...           'ARR_DELAY':[&apos...

Grouping with a custom aggregation function

pandas provides a number of aggregation functions to use with the groupby object. At some point, you may need to write your own custom user-defined function that does not exist in pandas or NumPy.

In this recipe, we use the college dataset to calculate the mean and standard deviation of the undergraduate student population per state. We then use this information to find the maximum number of standard deviations from the mean that any single population value is per state.

How to do it…

  1. Read in the college dataset, and find the mean and standard deviation of the undergraduate population by state:
    >>> college = pd.read_csv('data/college.csv')
    >>> (college
    ...     .groupby('STABBR')
    ...     ['UGDS']
    ...     .agg(['mean', 'std'])
    ...     .round(0)
    ... )
              mean      std
    STABBR                 
    AK      2493.0   4052.0
    AL     ...

Customizing aggregating functions with *args and **kwargs

When writing your own user-defined customized aggregation function, pandas implicitly passes it each of the aggregating columns one at a time as a Series. Occasionally, you will need to pass more arguments to your function than just the Series itself. To do so, you need to be aware of Python's ability to pass an arbitrary number of arguments to functions.

The signature to .agg is agg(func, *args, **kwargs). The func parameter is a reducing function, the string name of a reducing method, a list of reducing functions, or a dictionary mapping columns to functions or a list of functions. Additionally, as we have seen, you can use keyword arguments to create named aggregations.

If you have a reducing function that takes additional arguments that you would like to use, you can leverage the *args and **kwargs parameters to pass arguments to the reduction function. You can use *args to pass an...

Examining the groupby object

The immediate result from using the .groupby method on a DataFrame is a groupby object. Usually, we chain operations on this object to do aggregations or transformations without ever storing the intermediate values in variables.

In this recipe, we examine the groupby object to examine individual groups.

How to do it…

  1. Let's get started by grouping the state and religious affiliation columns from the college dataset, saving the result to a variable and confirming its type:
    >>> college = pd.read_csv('data/college.csv')
    >>> grouped = college.groupby(['STABBR', 'RELAFFIL'])
    >>> type(grouped)
    <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
    
  2. Use the dir function to discover the attributes of a groupby object:
    >>> print([attr for attr in dir(grouped) if not
    ...     attr.startswith('_')])
    ['CITY&apos...

Filtering for states with a minority majority

Previously, we examined using Boolean arrays to filter rows. In a similar fashion, when using the .groupby method, we can filter out groups. The .filter method of the groupby object accepts a function that must return either True or False to indicate whether a group is kept.

This .filter method applied after a call to the .groupby method is completely different to the DataFrame .filter method covered in the Selecting columns with methods recipe from Chapter 2, Essential DataFrame Operations.

One thing to be aware of is that when the .filter method is applied, the result does not use the grouping columns as the index, but keeps the original index! The DataFrame .filter method filters columns, not values.

In this recipe, we use the college dataset to find all the states that have more non-white undergraduate students than white. This is a dataset from the US, where whites form the majority and therefore, we are looking for states...

Transforming through a weight loss bet

One method to increase motivation to lose weight is to make a bet with someone else. The scenario in this recipe will track weight loss from two individuals throughout a four-month period and determine a winner.

In this recipe, we use simulated data from two individuals to track the percentage of weight loss over four months. At the end of each month, a winner will be declared based on the individual who lost the highest percentage of body weight for that month. To track weight loss, we group our data by month and person, and then call the .transform method to find the percentage weight loss change for each week against the start of the month.

We will use the .transform method in this recipe. This method returns a new object that preserves the index of the original DataFrame but allows you to do calculations on groups of the data.

How to do it…

  1. Read in the raw weight_loss dataset, and examine the first...

Calculating weighted mean SAT scores per state with apply

The groupby object has four methods that accept a function (or functions) to perform a calculation on each group. These four methods are .agg, .filter, .transform, and .apply. Each of the first three of these methods has a very specific output that the function must return. .agg must return a scalar value, .filter must return a Boolean, and .transform must return a Series or DataFrame with the same length as the passed group. The .apply method, however, may return a scalar value, a Series, or even a DataFrame of any shape, therefore making it very flexible. It is also called only once per group (on a DataFrame), while the .transform and .agg methods get called once for each aggregating column (on a Series). The .apply method's ability to return a single object when operating on multiple columns at the same time makes the calculation in this recipe possible.

In this recipe, we calculate the weighted average of both...

Grouping by continuous variables

When grouping in pandas, you typically use columns with discrete repeating values. If there are no repeated values, then grouping would be pointless as there would only be one row per group. Continuous numeric columns typically have few repeated values and are generally not used to form groups. However, if we can transform columns with continuous values into a discrete column by placing each value in a bin, rounding them, or using some other mapping, then grouping with them makes sense.

In this recipe, we explore the flights dataset to discover the distribution of airlines for different travel distances. This allows us, for example, to find the airline that makes the most flights between 500 and 1,000 miles. To accomplish this, we use the pandas cut function to discretize the distance of each flight flown.

How to do it…

  1. Read in the flights dataset:
    >>> flights = pd.read_csv('data/flights.csv')
    >...

Counting the total number of flights between cities

In the flights dataset, we have data on the origin and destination airport. It is trivial to count the number of flights originating in Houston and landing in Atlanta, for instance. What is more difficult is counting the total number of flights between the two cities.

In this recipe, we count the total number of flights between two cities, regardless of which one is the origin or destination. To accomplish this, we sort the origin and destination airports alphabetically so that each combination of airports always occurs in the same order. We can then use this new column arrangement to form groups and then to count.

How to do it…

  1. Read in the flights dataset, and find the total number of flights between each origin and destination airport:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size...

Finding the longest streak of on-time flights

One of the most important metrics for airlines is their on-time flight performance. The Federal Aviation Administration considers a flight delayed when it arrives at least 15 minutes later than its scheduled arrival time. pandas includes methods to calculate the total and percentage of on-time flights per airline. While these basic summary statistics are an important metric, there are other non-trivial calculations that are interesting, such as finding the length of consecutive on-time flights for each airline at each of its origin airports.

In this recipe, we find the longest consecutive streak of on-time flights for each airline at each origin airport. This requires each value in a column to be aware of the value immediately following it. We make clever use of the .diff and .cumsum methods to find streaks before applying this methodology to each of the groups.

The max_streak function we develop in this section exposes...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou