You're reading from Pandas 1.x Cookbook - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839213106

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Introduction

All the datasets used in the preceding chapters have not had much or any work done to change their structure. We immediately began processing the datasets in their original shape. Many datasets in the wild will need a significant amount of restructuring before commencing a more detailed analysis. In some cases, an entire project might only concern itself with formatting the data in such a way that it can be easily processed by someone else.

There are many terms that are used to describe the process of data restructuring, with tidy data being the most common to data scientists. Tidy data is a term coined by Hadley Wickham to describe a form of data that makes analysis easy to do. This chapter will cover many ideas formulated by Hadley and how to accomplish them with pandas. To learn a great deal more about tidy data, read Hadley's paper (http://vita.had.co.nz/papers/tidy-data.pdf).

The following is an example of untidy data:

...

Tidying variable values as column names with stack

To help understand the differences between tidy and messy data, let's take a look at a table that may or may not be in tidy form:

>>> import pandas as pd
>>> import numpy as np
>>> state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
>>> state_fruit
         Apple  Orange  Banana
Texas       12      10      40
Arizona      9       7      12
Florida      0      14     190

There does not appear to be anything messy about this table, and the information is easily consumable. However, according to the tidy principles, it isn't tidy. Each column name is the value of a variable. In fact, none of the variable names are even present in the DataFrame. One of the first steps to transform a messy dataset into tidy data is to identify all of the variables. In this particular dataset, we have variables for state and fruit. There's also the numeric data that wasn&apos...

Tidying variable values as column names with melt

Like most large Python libraries, pandas has many different ways to accomplish the same task, the differences usually being readability and performance. A DataFrame has a method named .melt that is similar to the .stack method described in the previous recipe but gives a bit more flexibility.

In this recipe, we use the .melt method to tidy a DataFrame with variable values as column names.

How to do it…

Read in the state_fruit2.csv dataset:

>>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
>>> state_fruit2
     State  Apple  Orange  Banana
0    Texas     12      10      40
1  Arizona      9       7      12
2  Florida      0      14     190

Use the .melt method by passing the appropriate columns to the id_vars and value_vars parameters:
```
>>> state_fruit2.melt(id_vars=['State'],
...     value_vars=['Apple', &apos...
```

Stacking multiple groups of variables simultaneously

Some datasets contain multiple groups of variables as column names that need to be stacked simultaneously into their own columns. An example involving the movie dataset can help clarify this. Let's begin by selecting all columns containing the actor names and their corresponding Facebook likes:

>>> movie = pd.read_csv('data/movie.csv')
>>> actor = movie[['movie_title', 'actor_1_name',
...                'actor_2_name', 'actor_3_name',
...                'actor_1_facebook_likes',
...                'actor_2_facebook_likes',
...                'actor_3_facebook_likes']]
>>> actor.head()
                                  movie_title  ...
0                                      Avatar  ...
1    Pirates of the Caribbean: At World's End  ...
2                                     Spectre  ...
3                       The...

Inverting stacked data

DataFrames have two similar methods, .stack and .melt, to convert horizontal column names into vertical column values. DataFrames can invert these two operations with the .unstack and .pivot methods, respectively. .stack and .unstack are methods that allow control over only the column and row indexes, while .melt and .pivot give more flexibility to choose which columns are reshaped.

In this recipe, we will call .stack and .melt on a dataset and promptly invert the operation with the .unstack and .pivot methods.

How to do it…

Read in the college dataset with the institution name as the index, and with only the undergraduate race columns:

>>> def usecol_func(name):
...     return 'UGDS_' in name or name == 'INSTNM'
>>> college = pd.read_csv('data/college.csv',
...     index_col='INSTNM',
...     usecols=usecol_func)
>>> college
              UGDS_WHITE  UGDS_BLACK...

Unstacking after a groupby aggregation

Grouping data by a single column and performing an aggregation on a single column returns a result that is easy to consume. When grouping by more than one column, a resulting aggregation might not be structured in a manner that makes consumption easy. Since .groupby operations, by default, put the unique grouping columns in the index, the .unstack method can be beneficial to rearrange the data so that it is presented in a manner that is more useful for interpretation.

In this recipe, we use the employee dataset to perform an aggregation, grouping by multiple columns. We then use the .unstack method to reshape the result into a format that makes for easier comparisons of different groups.

How to do it…

Read in the employee dataset and find the mean salary by race:

>>> employee = pd.read_csv('data/employee.csv')
>>> (employee
...     .groupby('RACE')
...     ['BASE_SALARY...

Replicating pivot_table with a groupby aggregation

At first glance, it may seem that the .pivot_table method provides a unique way to analyze data. However, after a little massaging, it is possible to replicate its functionality with the .groupby method. Knowing this equivalence can help shrink the universe of pandas functionality.

In this recipe, we use the flights dataset to create a pivot table and then recreate it using the .groupby method.

How to do it…

Read in the flights dataset, and use the .pivot_table method to find the total number of canceled flights per origin airport for each airline:

>>> flights = pd.read_csv('data/flights.csv')
>>> fpt = flights.pivot_table(index='AIRLINE',
...     columns='ORG_AIR',
...     values='CANCELLED',
...     aggfunc='sum',
...     fill_value=0)
>>> fpt
ORG_AIR  ATL  DEN  DFW  IAH  LAS  LAX  MSP  ORD  PHX  SFO
AIRLINE
AA   ...

Renaming axis levels for easy reshaping

Reshaping with the .stack and .unstack methods is far easier when each axis (both index and column) level has a name. pandas allows users to reference each axis level by integer location or by name. Since integer location is implicit and not explicit, you should consider using level names whenever possible. This advice follows from The Zen of Python (type import this if you are not familiar with it), a short list of guiding principles for Python, of which the second one is "Explicit is better than implicit."

When grouping or aggregating with multiple columns, the resulting pandas object will have multiple levels in one or both of the axes. In this recipe, we will name each level of each axis and then use the .stack and .unstack methods to reshape the data to the desired form.

How to do it…

Read in the college dataset, and find a few basic summary statistics on the undergraduate population and SAT math...

Tidying when multiple variables are stored as column names

One particular flavor of messy data appears whenever the column names contain multiple different variables themselves. A common example of this scenario occurs when age and sex are concatenated together. To tidy datasets like this, we must manipulate the columns with the pandas .str attribute. This attribute contains additional methods for string processing.

In this recipe, we will first identify all the variables, of which some will be concatenated together as column names. We then reshape the data and parse the text to extract the correct variable values.

How to do it…

Read in the men's weightlifting dataset, and identify the variables:

>>> weightlifting = pd.read_csv('data/weightlifting_men.csv')
>>> weightlifting
  Weight Category  M35 35-39  ...  M75 75-79  M80 80+
0           56           137  ...         62       55
1           62           152...

Tidying when multiple variables are stored as a single column

Tidy datasets must have a single column for each variable. Occasionally, multiple variable names are placed in a single column with their corresponding value placed in another.

In this recipe, we identify the column containing the improperly structured variables and pivot it to create tidy data.

How to do it…

Read in the restaurant inspections dataset, and convert the Date column data type to datetime64:

>>> inspections = pd.read_csv('data/restaurant_inspections.csv',
...     parse_dates=['Date'])
>>> inspections
                              Name  ...
0                E & E Grill House  ...
1                E & E Grill House  ...
2                E & E Grill House  ...
3                E & E Grill House  ...
4                E & E Grill House  ...
..                             ...  ...
495  PIER SIXTY ONE-THE LIGHTHOUSE  ...
496...

Tidying when two or more values are stored in the same cell

Tabular data, by nature, is two-dimensional, and thus, there is a limited amount of information that can be presented in a single cell. As a workaround, you will occasionally see datasets with more than a single value stored in the same cell. Tidy data allows for just a single value for each cell. To rectify these situations, you will typically need to parse the string data into multiple columns with the methods from the .str attribute.

In this recipe, we examine a dataset that has a column containing multiple different variables in each cell. We use the .str attribute to parse these strings into separate columns to tidy the data.

How to do it...

Read in the Texas cities dataset:

>>> cities = pd.read_csv('data/texas_cities.csv')
>>> cities
      City             Geolocation
0  Houston  29.7604° N, 95.3698° W
1   Dallas  32.7767° N, 96.7970° W
2   Austin...

Tidying when variables are stored in column names and values

One particularly difficult form of messy data to diagnose appears whenever variables are stored both horizontally across the column names and vertically down column values. This type of dataset usually is not found in a database, but from a summarized report that someone else has already generated.

How to do it…

In this recipe, data is reshaped into tidy data with the .melt and .pivot_table methods.

Read in the sensors dataset:

>>> sensors = pd.read_csv('data/sensors.csv')
>>> sensors
  Group     Property  2012  2013  2014  2015  2016
0     A     Pressure   928   873   814   973   870
1     A  Temperature  1026  1038  1009  1036  1042
2     A         Flow   819   806   861   882   856
3     B     Pressure   817   877   914   806   942
4     B  Temperature  1008  1041  1009  1002  1013
5     B         Flow   887   899   837   824   873

The only variable...

The rest of the chapter is locked

You have been reading a chapter from

Pandas 1.x Cookbook - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

Other recommended products

Related to this chapter

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

BookJun 2018168 pages5

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

BookOct 2019674 pages

Hands-On Data Analysis with Pandas

This book will be a handy guide to quickly learn pandas and understand how it can empower you in the exciting world of data manipulation, analysis, and data science. You will learn how to use pandas to perform numeric and statistical analysis using real-world examples. You will also visualize statistical data and apply pandas to different domains.

BookJul 2019740 pages

Hands-On Data Analysis with Pandas

Knowing how to work with data to extract insights generates significant value. This book will help you to develop data analysis skills using a hands-on approach and real-world data. You’ll get up to speed with pandas 1.x in no time and build some software engineering skills in the process, vastly expanding your data science toolbox.

BookApr 2021788 pages5

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Pandas 1.x Cookbook - Second Edition

Introduction

Tidying variable values as column names with stack

Tidying variable values as column names with melt

How to do it…

Stacking multiple groups of variables simultaneously

Inverting stacked data

How to do it…

Unstacking after a groupby aggregation

How to do it…

Replicating pivot_table with a groupby aggregation

How to do it…

Renaming axis levels for easy reshaping

How to do it…

Tidying when multiple variables are stored as column names

How to do it…

Tidying when multiple variables are stored as a single column

How to do it…

Tidying when two or more values are stored in the same cell

How to do it...

Tidying when variables are stored in column names and values

How to do it…

Unlock this book and the full library FREE for 7 days

Authors (2)

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook

Learning pandas

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy &amp; pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.