Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

All the datasets used in the preceding chapters have not had much or any work done to change their structure. We immediately began processing the datasets in their original shape. Many datasets in the wild will need a significant amount of restructuring before commencing a more detailed analysis. In some cases, an entire project might only concern itself with formatting the data in such a way that it can be easily processed by someone else.

There are many terms that are used to describe the process of data restructuring, with tidy data being the most common to data scientists. Tidy data is a term coined by Hadley Wickham to describe a form of data that makes analysis easy to do. This chapter will cover many ideas formulated by Hadley and how to accomplish them with pandas. To learn a great deal more about tidy data, read Hadley's paper (http://vita.had.co.nz/papers/tidy-data.pdf).

The following is an example of untidy data:

...

Tidying variable values as column names with stack

To help understand the differences between tidy and messy data, let's take a look at a table that may or may not be in tidy form:

>>> import pandas as pd
>>> import numpy as np
>>> state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
>>> state_fruit
         Apple  Orange  Banana
Texas       12      10      40
Arizona      9       7      12
Florida      0      14     190

There does not appear to be anything messy about this table, and the information is easily consumable. However, according to the tidy principles, it isn't tidy. Each column name is the value of a variable. In fact, none of the variable names are even present in the DataFrame. One of the first steps to transform a messy dataset into tidy data is to identify all of the variables. In this particular dataset, we have variables for state and fruit. There's also the numeric data that wasn&apos...

Tidying variable values as column names with melt

Like most large Python libraries, pandas has many different ways to accomplish the same task, the differences usually being readability and performance. A DataFrame has a method named .melt that is similar to the .stack method described in the previous recipe but gives a bit more flexibility.

In this recipe, we use the .melt method to tidy a DataFrame with variable values as column names.

How to do it…

  1. Read in the state_fruit2.csv dataset:
    >>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
    >>> state_fruit2
         State  Apple  Orange  Banana
    0    Texas     12      10      40
    1  Arizona      9       7      12
    2  Florida      0      14     190
    
  2. Use the .melt method by passing the appropriate columns to the id_vars and value_vars parameters:
    >>> state_fruit2.melt(id_vars=['State'],
    ...     value_vars=['Apple', &apos...

Stacking multiple groups of variables simultaneously

Some datasets contain multiple groups of variables as column names that need to be stacked simultaneously into their own columns. An example involving the movie dataset can help clarify this. Let's begin by selecting all columns containing the actor names and their corresponding Facebook likes:

>>> movie = pd.read_csv('data/movie.csv')
>>> actor = movie[['movie_title', 'actor_1_name',
...                'actor_2_name', 'actor_3_name',
...                'actor_1_facebook_likes',
...                'actor_2_facebook_likes',
...                'actor_3_facebook_likes']]
>>> actor.head()
                                  movie_title  ...
0                                      Avatar  ...
1    Pirates of the Caribbean: At World's End  ...
2                                     Spectre  ...
3                       The...

Inverting stacked data

DataFrames have two similar methods, .stack and .melt, to convert horizontal column names into vertical column values. DataFrames can invert these two operations with the .unstack and .pivot methods, respectively. .stack and .unstack are methods that allow control over only the column and row indexes, while .melt and .pivot give more flexibility to choose which columns are reshaped.

In this recipe, we will call .stack and .melt on a dataset and promptly invert the operation with the .unstack and .pivot methods.

How to do it…

  1. Read in the college dataset with the institution name as the index, and with only the undergraduate race columns:
    >>> def usecol_func(name):
    ...     return 'UGDS_' in name or name == 'INSTNM'
    >>> college = pd.read_csv('data/college.csv',
    ...     index_col='INSTNM',
    ...     usecols=usecol_func)
    >>> college
                  UGDS_WHITE  UGDS_BLACK...

Unstacking after a groupby aggregation

Grouping data by a single column and performing an aggregation on a single column returns a result that is easy to consume. When grouping by more than one column, a resulting aggregation might not be structured in a manner that makes consumption easy. Since .groupby operations, by default, put the unique grouping columns in the index, the .unstack method can be beneficial to rearrange the data so that it is presented in a manner that is more useful for interpretation.

In this recipe, we use the employee dataset to perform an aggregation, grouping by multiple columns. We then use the .unstack method to reshape the result into a format that makes for easier comparisons of different groups.

How to do it…

  1. Read in the employee dataset and find the mean salary by race:
    >>> employee = pd.read_csv('data/employee.csv')
    >>> (employee
    ...     .groupby('RACE')
    ...     ['BASE_SALARY...

Replicating pivot_table with a groupby aggregation

At first glance, it may seem that the .pivot_table method provides a unique way to analyze data. However, after a little massaging, it is possible to replicate its functionality with the .groupby method. Knowing this equivalence can help shrink the universe of pandas functionality.

In this recipe, we use the flights dataset to create a pivot table and then recreate it using the .groupby method.

How to do it…

  1. Read in the flights dataset, and use the .pivot_table method to find the total number of canceled flights per origin airport for each airline:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> fpt = flights.pivot_table(index='AIRLINE',
    ...     columns='ORG_AIR',
    ...     values='CANCELLED',
    ...     aggfunc='sum',
    ...     fill_value=0)
    >>> fpt
    ORG_AIR  ATL  DEN  DFW  IAH  LAS  LAX  MSP  ORD  PHX  SFO
    AIRLINE
    AA   ...

Renaming axis levels for easy reshaping

Reshaping with the .stack and .unstack methods is far easier when each axis (both index and column) level has a name. pandas allows users to reference each axis level by integer location or by name. Since integer location is implicit and not explicit, you should consider using level names whenever possible. This advice follows from The Zen of Python (type import this if you are not familiar with it), a short list of guiding principles for Python, of which the second one is "Explicit is better than implicit."

When grouping or aggregating with multiple columns, the resulting pandas object will have multiple levels in one or both of the axes. In this recipe, we will name each level of each axis and then use the .stack and .unstack methods to reshape the data to the desired form.

How to do it…

  1. Read in the college dataset, and find a few basic summary statistics on the undergraduate population and SAT math...

Tidying when multiple variables are stored as column names

One particular flavor of messy data appears whenever the column names contain multiple different variables themselves. A common example of this scenario occurs when age and sex are concatenated together. To tidy datasets like this, we must manipulate the columns with the pandas .str attribute. This attribute contains additional methods for string processing.

In this recipe, we will first identify all the variables, of which some will be concatenated together as column names. We then reshape the data and parse the text to extract the correct variable values.

How to do it…

  1. Read in the men's weightlifting dataset, and identify the variables:
    >>> weightlifting = pd.read_csv('data/weightlifting_men.csv')
    >>> weightlifting
      Weight Category  M35 35-39  ...  M75 75-79  M80 80+
    0           56           137  ...         62       55
    1           62           152...

Tidying when multiple variables are stored as a single column

Tidy datasets must have a single column for each variable. Occasionally, multiple variable names are placed in a single column with their corresponding value placed in another.

In this recipe, we identify the column containing the improperly structured variables and pivot it to create tidy data.

How to do it…

  1. Read in the restaurant inspections dataset, and convert the Date column data type to datetime64:
    >>> inspections = pd.read_csv('data/restaurant_inspections.csv',
    ...     parse_dates=['Date'])
    >>> inspections
                                  Name  ...
    0                E & E Grill House  ...
    1                E & E Grill House  ...
    2                E & E Grill House  ...
    3                E & E Grill House  ...
    4                E & E Grill House  ...
    ..                             ...  ...
    495  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    496...

Tidying when two or more values are stored in the same cell

Tabular data, by nature, is two-dimensional, and thus, there is a limited amount of information that can be presented in a single cell. As a workaround, you will occasionally see datasets with more than a single value stored in the same cell. Tidy data allows for just a single value for each cell. To rectify these situations, you will typically need to parse the string data into multiple columns with the methods from the .str attribute.

In this recipe, we examine a dataset that has a column containing multiple different variables in each cell. We use the .str attribute to parse these strings into separate columns to tidy the data.

How to do it...

  1. Read in the Texas cities dataset:
    >>> cities = pd.read_csv('data/texas_cities.csv')
    >>> cities
          City             Geolocation
    0  Houston  29.7604° N, 95.3698° W
    1   Dallas  32.7767° N, 96.7970° W
    2   Austin...

Tidying when variables are stored in column names and values

One particularly difficult form of messy data to diagnose appears whenever variables are stored both horizontally across the column names and vertically down column values. This type of dataset usually is not found in a database, but from a summarized report that someone else has already generated.

How to do it…

In this recipe, data is reshaped into tidy data with the .melt and .pivot_table methods.

  1. Read in the sensors dataset:
    >>> sensors = pd.read_csv('data/sensors.csv')
    >>> sensors
      Group     Property  2012  2013  2014  2015  2016
    0     A     Pressure   928   873   814   973   870
    1     A  Temperature  1026  1038  1009  1036  1042
    2     A         Flow   819   806   861   882   856
    3     B     Pressure   817   877   914   806   942
    4     B  Temperature  1008  1041  1009  1002  1013
    5     B         Flow   887   899   837   824   873
    
  2. The only variable...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou