You're reading from Pandas 1.x Cookbook - Second Edition

Product type Book

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Pages 626 pages

Edition 2nd Edition

Languages

Python

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Table of Contents (17) Chapters

Preface

Pandas Foundations

Essential DataFrame Operations

Creating and Persisting DataFrames

Beginning Data Analysis

Exploratory Data Analysis

Selecting Subsets of Data

Filtering Rows

Index Alignment

Grouping for Aggregation, Filtration, and Transformation

Restructuring Data into a Tidy Form

Combining Pandas Objects

Time Series Analysis

Visualization with Matplotlib, Pandas, and Seaborn

Debugging and Testing Pandas

Other Books You May Enjoy

Index

Code to transform data

In this chapter, we will look at some code that analyzes survey data that Kaggle did in 2018. The survey queried Kaggle users about socio-economic information.

This section will present the survey data along with some code to analyze it. The subtitle for this data is "the most comprehensive dataset available on the state of machine learning and data science". Let's dig into this data and see what it has. The data was originally available at https://www.kaggle.com/kaggle/kaggle-survey-2018.

How to do it…

Load the data into a DataFrame:

>>> import pandas as pd
>>> import numpy as np
>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
...     print(z.namelist())
...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
...     df = kag.iloc[1:]
['multipleChoiceResponses.csv', 'freeFormResponses...

Apply performance

The .apply method on a Series and DataFrame is one of the slowest operations in pandas. In this recipe, we will explore the speed of it and see if we can debug what is going on.

How to do it…

Let's time how long one use of the .apply method takes using the %%timeit cell magic in Jupiter. This is the code from the tweak_kag function that limits the cardinality of the country column (Q3):

>>> %%timeit
>>> def limit_countries(val):
...      if val in  {'United States of America', 'India', 'China'}:
...          return val
...      return 'Another'
>>> q3 = df.Q3.apply(limit_countries).rename('Country')
6.42 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let's look at using the .replace method instead of .apply and see if that improves performance:
```
>>> %%timeit
>>> other_values = df...
```

Improving apply performance with Dask, Pandarell, Swifter, and more

Sometimes .apply is convenient. Various libraries enable parallelizing such operations. There are various mechanisms to do this. The easiest is to try and leverage vectorization. Math operations are vectorized in pandas, if you add a number (say 5) to a numerical series, pandas will not add 5 to each value. Rather it will leverage a feature of modern CPUs to do the operation one time.

If you cannot vectorize, as is the case with our limit_countries function, you have other options. This section will show a few of them.

Note that you will need to install these libraries as they are not included with pandas.

The examples show limiting values in the country column from the survey data to a few values.

How to do it…

Import and initialize the Pandarallel library. This library tries to parallelize pandas operations across all available CPUs. Note that this library runs fine on Linux and...

Inspecting code

The Jupyter environment has an extension that allows you to quickly pull up the documentation or the source code for a class, method, or function. I strongly encourage you to get used to using these. If you can stay in the Jupyter environment to answer questions that may come up, you will increase your productivity.

In this section, we will show how to look at the source code for the .apply method. It is easiest to look at the documentation for a DataFrame or series method directly on the DataFrame or series object, respectively. Throughout this book, we have heavily recommended chaining operations on pandas objects. Sadly Jupyter (and any other editor environment) is not able to perform code completion or look up documentation on the intermediate object returned from a chained method call. Hence the recommendation to perform the lookup directly on a method that is not chained.

How to do it…

Load the survey data:
```
>>>...
```

Debugging in Jupyter

The previous recipes have shown how to understand pandas code and inspect it from Jupyter. In this section, we will look at using the IPython debugger (ipdb) in Jupyter.

In this section, I will create a function that throws an error when I try to use it with the series .apply method. I will use ipdb to debug it.

How to do it…

Load the survey data:

>>> import zipfile
>>> url = 'data/kaggle-survey-2018.zip'
>>> with zipfile.ZipFile(url) as z:
...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
...     df = kag.iloc[1:]

Try and run a function to add one to a series:

>>> def add1(x):
...     return x + 1
>>> df.Q3.apply(add1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-6ce28d2fea57> in <module>
  ...

Managing data integrity with Great Expectations

Great Expectations is a third-party tool that allows you to capture and define the properties of a dataset. You can save these properties and then use them to validate future data to ensure data integrity. This can be very useful when building machine learning models, as new categorical data values and numeric outliers tend to cause a model to perform poorly or error out.

In this section, we will look at the Kaggle dataset and make an expectation suite to test and validate the data.

How to do it…

Read the data using the tweak_kag function previously defined:
```
>>> kag = tweak_kag(df)
```
Use the Great Expectations from_pandas function to read in a Great Expectations DataFrame (a subclass of DataFrame with some extra methods):
```
>>> import great_expectations as ge
>>> kag_ge = ge.from_pandas(kag)
```
Examine the extra methods on the DataFrame: ...

Using pytest with pandas

In this section, we will show how to test your pandas code. We do this by testing the artifacts. We will use the third-party library, pytest, to do this testing.

For this recipe, we will not be using Jupyter, but rather the command line.

How to do it…

Create a project data layout. The pytest library supports projects laid out in a couple different styles. We will create a folder structure that looks like this:

kag-demo-pytest/
├── data
│ └── kaggle-survey-2018.zip
├── kag.py
└── test
    └── test_kag.py

The kag.py file has code to load the raw data and code to tweak it. It looks like this:

import pandas as pd
import zipfile
def load_raw(zip_fname):
    with zipfile.ZipFile(zip_fname) as z:
        kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
        df = kag.iloc[1:]
    return...

Generating tests with Hypothesis

The Hypothesis library is a third-party library for generating tests, or performing property-based testing. You create a strategy (an object that generates samples of data) and then run your code against the generated output of the strategy. You want to test an invariant, or something about your data that you presume to always hold true.

Again, there could be a book written solely about this type of testing, but in this section we will show an example of using the library.

We will show how to generate Kaggle survey data, then using that generated survey data, we will run it against the tweak_kag function and validate that the function will work on new data.

We will leverage the testing code found in the previous section. The Hypothesis library works with pytest, so we can use the same layout.