You're reading from Pandas 1.x Cookbook - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839213106

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Introduction

One of the most fundamental tasks during data analysis involves splitting data into independent groups before performing a calculation on each group. This methodology has been around for quite some time but has more recently been referred to as split-apply-combine. This chapter covers the powerful .groupby method, which allows you to group your data in any way imaginable and apply any type of function independently to each group before returning a single dataset.

Before we get started with the recipes, we will need to know just a little terminology. All basic groupby operations have grouping columns, and each unique combination of values in these columns represents an independent grouping of the data. The syntax looks as follows:

df.groupby(['list', 'of', 'grouping', 'columns'])
df.groupby('single_column')  # when grouping by a single column

The result of calling the .groupby method is a groupby object...

Defining an aggregation

In this recipe, we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function. We will find the average arrival delay for each airline. pandas has different syntaxes to create an aggregation, and this recipe will show them.

How to do it…

Read in the flights dataset:

>>> import pandas as pd
>>> import numpy as np
>>> flights = pd.read_csv('data/flights.csv')
>>> flights.head()
0      1    1        4  ...      65.0        0         0
1      1    1        4  ...     -13.0        0         0
2      1    1        4  ...      35.0        0         0
3      1    1        4  ...      -7.0        0         0
4      1    1        4  ...      39.0        0         0

Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place...

Grouping and aggregating with multiple columns and functions

It is possible to group and aggregate with multiple columns. The syntax is slightly different than it is for grouping and aggregating with a single column. As usual with any kind of grouping operation, it helps to identify the three components: the grouping columns, aggregating columns, and aggregating functions.

In this recipe, we showcase the flexibility of the .groupby method by answering the following queries:

Finding the number of canceled flights for every airline per weekday
Finding the number and percentage of canceled and diverted flights for every airline per weekday
For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime

How to do it…

Read in the flights dataset, and answer the first query by defining the grouping columns (AIRLINE, WEEKDAY), the aggregating...

Removing the MultiIndex after grouping

Inevitably, when using groupby, you will create a MultiIndex. MultiIndexes can happen in both the index and the columns. DataFrames with MultiIndexes are more difficult to navigate and occasionally have confusing column names as well.

In this recipe, we perform an aggregation with the .groupby method to create a DataFrame with a MultiIndex for the rows and columns. Then, we manipulate the index so that it has a single level and the column names are descriptive.

How to do it…

Read in the flights dataset, write a statement to find the total and average miles flown, and the maximum and minimum arrival delay for each airline for each weekday:

>>> flights = pd.read_csv('data/flights.csv')
>>> airline_info = (flights
...     .groupby(['AIRLINE', 'WEEKDAY'])
...     .agg({'DIST':['sum', 'mean'],
...           'ARR_DELAY':[&apos...

Grouping with a custom aggregation function

pandas provides a number of aggregation functions to use with the groupby object. At some point, you may need to write your own custom user-defined function that does not exist in pandas or NumPy.

In this recipe, we use the college dataset to calculate the mean and standard deviation of the undergraduate student population per state. We then use this information to find the maximum number of standard deviations from the mean that any single population value is per state.

How to do it…

Read in the college dataset, and find the mean and standard deviation of the undergraduate population by state:

>>> college = pd.read_csv('data/college.csv')
>>> (college
...     .groupby('STABBR')
...     ['UGDS']
...     .agg(['mean', 'std'])
...     .round(0)
... )
          mean      std
STABBR                 
AK      2493.0   4052.0
AL     ...

Customizing aggregating functions with *args and **kwargs

When writing your own user-defined customized aggregation function, pandas implicitly passes it each of the aggregating columns one at a time as a Series. Occasionally, you will need to pass more arguments to your function than just the Series itself. To do so, you need to be aware of Python's ability to pass an arbitrary number of arguments to functions.

The signature to .agg is agg(func, *args, **kwargs). The func parameter is a reducing function, the string name of a reducing method, a list of reducing functions, or a dictionary mapping columns to functions or a list of functions. Additionally, as we have seen, you can use keyword arguments to create named aggregations.

If you have a reducing function that takes additional arguments that you would like to use, you can leverage the *args and **kwargs parameters to pass arguments to the reduction function. You can use *args to pass an...

Examining the groupby object

The immediate result from using the .groupby method on a DataFrame is a groupby object. Usually, we chain operations on this object to do aggregations or transformations without ever storing the intermediate values in variables.

In this recipe, we examine the groupby object to examine individual groups.

How to do it…

Let's get started by grouping the state and religious affiliation columns from the college dataset, saving the result to a variable and confirming its type:

>>> college = pd.read_csv('data/college.csv')
>>> grouped = college.groupby(['STABBR', 'RELAFFIL'])
>>> type(grouped)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

Use the dir function to discover the attributes of a groupby object:

>>> print([attr for attr in dir(grouped) if not
...     attr.startswith('_')])
['CITY&apos...

Filtering for states with a minority majority

Previously, we examined using Boolean arrays to filter rows. In a similar fashion, when using the .groupby method, we can filter out groups. The .filter method of the groupby object accepts a function that must return either True or False to indicate whether a group is kept.

This .filter method applied after a call to the .groupby method is completely different to the DataFrame .filter method covered in the Selecting columns with methods recipe from Chapter 2, Essential DataFrame Operations.

One thing to be aware of is that when the .filter method is applied, the result does not use the grouping columns as the index, but keeps the original index! The DataFrame .filter method filters columns, not values.

In this recipe, we use the college dataset to find all the states that have more non-white undergraduate students than white. This is a dataset from the US, where whites form the majority and therefore, we are looking for states...

Transforming through a weight loss bet

One method to increase motivation to lose weight is to make a bet with someone else. The scenario in this recipe will track weight loss from two individuals throughout a four-month period and determine a winner.

In this recipe, we use simulated data from two individuals to track the percentage of weight loss over four months. At the end of each month, a winner will be declared based on the individual who lost the highest percentage of body weight for that month. To track weight loss, we group our data by month and person, and then call the .transform method to find the percentage weight loss change for each week against the start of the month.

We will use the .transform method in this recipe. This method returns a new object that preserves the index of the original DataFrame but allows you to do calculations on groups of the data.

How to do it…

Read in the raw weight_loss dataset, and examine the first...

Calculating weighted mean SAT scores per state with apply

The groupby object has four methods that accept a function (or functions) to perform a calculation on each group. These four methods are .agg, .filter, .transform, and .apply. Each of the first three of these methods has a very specific output that the function must return. .agg must return a scalar value, .filter must return a Boolean, and .transform must return a Series or DataFrame with the same length as the passed group. The .apply method, however, may return a scalar value, a Series, or even a DataFrame of any shape, therefore making it very flexible. It is also called only once per group (on a DataFrame), while the .transform and .agg methods get called once for each aggregating column (on a Series). The .apply method's ability to return a single object when operating on multiple columns at the same time makes the calculation in this recipe possible.

In this recipe, we calculate the weighted average of both...

Grouping by continuous variables

When grouping in pandas, you typically use columns with discrete repeating values. If there are no repeated values, then grouping would be pointless as there would only be one row per group. Continuous numeric columns typically have few repeated values and are generally not used to form groups. However, if we can transform columns with continuous values into a discrete column by placing each value in a bin, rounding them, or using some other mapping, then grouping with them makes sense.

In this recipe, we explore the flights dataset to discover the distribution of airlines for different travel distances. This allows us, for example, to find the airline that makes the most flights between 500 and 1,000 miles. To accomplish this, we use the pandas cut function to discretize the distance of each flight flown.

How to do it…

Read in the flights dataset:

>>> flights = pd.read_csv('data/flights.csv')
>...

Counting the total number of flights between cities

In the flights dataset, we have data on the origin and destination airport. It is trivial to count the number of flights originating in Houston and landing in Atlanta, for instance. What is more difficult is counting the total number of flights between the two cities.

In this recipe, we count the total number of flights between two cities, regardless of which one is the origin or destination. To accomplish this, we sort the origin and destination airports alphabetically so that each combination of airports always occurs in the same order. We can then use this new column arrangement to form groups and then to count.

How to do it…

Read in the flights dataset, and find the total number of flights between each origin and destination airport:

>>> flights = pd.read_csv('data/flights.csv')
>>> flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size...

Finding the longest streak of on-time flights

One of the most important metrics for airlines is their on-time flight performance. The Federal Aviation Administration considers a flight delayed when it arrives at least 15 minutes later than its scheduled arrival time. pandas includes methods to calculate the total and percentage of on-time flights per airline. While these basic summary statistics are an important metric, there are other non-trivial calculations that are interesting, such as finding the length of consecutive on-time flights for each airline at each of its origin airports.

In this recipe, we find the longest consecutive streak of on-time flights for each airline at each origin airport. This requires each value in a column to be aware of the value immediately following it. We make clever use of the .diff and .cumsum methods to find streaks before applying this methodology to each of the groups.

The max_streak function we develop in this section exposes...

The rest of the chapter is locked

You have been reading a chapter from

Pandas 1.x Cookbook - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

Other recommended products

Related to this chapter

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

BookJun 2018168 pages5

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

BookOct 2019674 pages

Hands-On Data Analysis with Pandas

This book will be a handy guide to quickly learn pandas and understand how it can empower you in the exciting world of data manipulation, analysis, and data science. You will learn how to use pandas to perform numeric and statistical analysis using real-world examples. You will also visualize statistical data and apply pandas to different domains.

BookJul 2019740 pages

Hands-On Data Analysis with Pandas

Knowing how to work with data to extract insights generates significant value. This book will help you to develop data analysis skills using a hands-on approach and real-world data. You’ll get up to speed with pandas 1.x in no time and build some software engineering skills in the process, vastly expanding your data science toolbox.

BookApr 2021788 pages5

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Pandas 1.x Cookbook - Second Edition

Introduction

Defining an aggregation

How to do it…

Grouping and aggregating with multiple columns and functions

How to do it…

Removing the MultiIndex after grouping

How to do it…

Grouping with a custom aggregation function

How to do it…

Customizing aggregating functions with *args and **kwargs

Examining the groupby object

How to do it…

Filtering for states with a minority majority

Transforming through a weight loss bet

How to do it…

Calculating weighted mean SAT scores per state with apply

Grouping by continuous variables

How to do it…

Counting the total number of flights between cities

How to do it…

Finding the longest streak of on-time flights

Unlock this book and the full library FREE for 7 days

Authors (2)

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook

Learning pandas

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy &amp; pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.