Reader small image

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product typeBook
Published inApr 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800563452
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Stefanie Molin
Stefanie Molin
author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Right arrow

Chapter 4: Aggregating Pandas DataFrames

In this chapter, we will continue our discussion of data wrangling from Chapter 3, Data Wrangling with Pandas, by addressing the enrichment and aggregation of data. This includes essential skills, such as merging dataframes, creating new columns, performing window calculations, and aggregating by group membership. Calculating aggregations and summaries will help us draw conclusions about our data.

We will also take a look at the additional functionality pandas has for working with time series data, beyond the time series slicing we introduced in previous chapters, including how we can roll up the data with aggregation and select it based on the time of day. Much of the data we will encounter is time series data, so being able to effectively work with time series is paramount. Of course, performing these operations efficiently is important, so we will also review how to write efficient pandas code.

This chapter will get us comfortable with...

Chapter materials

The materials for this chapter can be found on GitHub at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_04. There are four notebooks that we will work through, each numbered according to when they will be used. The text will prompt you to switch. We will begin with the 1-querying_and_merging.ipynb notebook to learn about querying and merging dataframes. Then, we will move on to the 2-dataframe_operations.ipynb notebook to discuss data enrichment through operations such as binning, window functions, and pipes. For this section, we will also use the window_calc.py Python file, which contains a function for performing window calculations using pipes.

Tip

The understanding_window_calculations.ipynb notebook contains some interactive visualizations for understanding window functions. This may require some additional setup, but the instructions are in the notebook.

Next, in the 3-aggregations.ipynb notebook, we will discuss...

Performing database-style operations on DataFrames

DataFrame objects are analogous to tables in a database: each has a name we refer to it by, is composed of rows, and contains columns of specific data types. Consequently, pandas allows us to carry out database-style operations on them. Traditionally, databases support a minimum of four operations, called CRUD: Create, Read, Update, and Delete.

A database query language—most commonly SQL (pronounced sequel or S-Q-L), which stands for Structured Query Language—is used to ask the database to perform these operations. Knowledge of SQL is not required for this book; however, we will look at the SQL equivalent for the pandas operations that will be discussed in this section since it may aid the understanding of those familiar with SQL. Many data professionals have some familiarity with basic SQL, so consult the Further reading section for resources that provide a more formal introduction.

For this section, we will be...

Using DataFrame operations to enrich data

Now that we've discussed how to query and merge DataFrame objects, let's learn how to perform complex operations on them to create and modify columns and rows. For this section, we will be working in the 2-dataframe_operations.ipynb notebook using the weather data, along with Facebook stock's volume traded and opening, high, low, and closing prices daily for 2018. Let's import what we will need and read in the data:

>>> import numpy as np
>>> import pandas as pd
>>> weather = pd.read_csv(
...     'data/nyc_weather_2018.csv', parse_dates=['date']
... )
>>> fb = pd.read_csv(
...     'data/fb_2018.csv', index_col='date', parse_dates=True
... )

We will begin by reviewing operations that summarize entire rows and columns before moving on to binning, applying functions across rows and columns, and...

Aggregating data

We already got a sneak peek at aggregation when we discussed window calculations and pipes in the previous section. Here, we will focus on summarizing the dataframe through aggregation, which will change the shape of our dataframe (often through row reduction). We also saw how easy it is to take advantage of vectorized NumPy functions on pandas data structures, especially to perform aggregations. This is what NumPy does best: it performs computationally efficient mathematical operations on numeric arrays.

NumPy pairs well with aggregating dataframes since it gives us an easy way to summarize data with different pre-written functions; often, when aggregating, we just need the NumPy function, since most of what we would want to write ourselves has previously been built. We have already seen some NumPy functions commonly used for aggregations, such as np.sum(), np.mean(), np.min(), and np.max(); however, we aren't limited to numeric operations—we can use...

Working with time series data

With time series data, we have some additional operations we can use, for anything from selection and filtering to aggregation. We will be exploring some of this functionality in the 4-time_series.ipynb notebook. Let's start off by reading in the Facebook data from the previous sections:

>>> import numpy as np
>>> import pandas as pd
>>> fb = pd.read_csv(
...     'data/fb_2018.csv', index_col='date', parse_dates=True
... ).assign(trading_volume=lambda x: pd.cut( 
...     x.volume, bins=3, labels=['low', 'med', 'high']     
... ))

We will begin this section by discussing the selection and filtering of time series data before moving on to shifting, differencing, resampling, and finally merging based on time. Note that it's important to set the index to our date (or datetime) column, which...

Summary

In this chapter, we discussed how to join dataframes, how to determine the data we will lose for each type of join using set operations, and how to query dataframes as we would a database. We then went over some more involved transformations on our columns, such as binning and ranking, and how to do so efficiently with the apply() method. We also learned the importance of vectorized operations in writing efficient pandas code. Then, we explored window calculations and using pipes for cleaner code. Our discussion of window calculations served as a primer for aggregating across whole dataframes and by groups. We also went over how to generate pivot tables and crosstabs. Finally, we looked at some time series-specific functionality in pandas for everything from selection and aggregation to merging.

In the next chapter, we will cover visualization, which pandas implements by providing a wrapper around matplotlib. Data wrangling will play a key role in prepping our data for visualization...

Exercises

Using the CSV files in the exercises/ folder and what we have learned so far in this book, complete the following exercises:

  1. With the earthquakes.csv file, select all the earthquakes in Japan with a magnitude of 4.9 or greater using the mb magnitude type.
  2. Create bins for each full number of earthquake magnitude (for instance, the first bin is (0, 1], the second is (1, 2], and so on) with the ml magnitude type and count how many are in each bin.
  3. Using the faang.csv file, group by the ticker and resample to monthly frequency. Make the following aggregations:

    a) Mean of the opening price

    b) Maximum of the high price

    c) Minimum of the low price

    d) Mean of the closing price

    e) Sum of the volume traded

  4. Build a crosstab with the earthquake data between the tsunami column and the magType column. Rather than showing the frequency count, show the maximum magnitude that was observed for each combination. Put the magnitude type along the columns.
  5. Calculate the rolling...

Further reading

Check out the following resources for more information on the topics that were covered in this chapter:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin