Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pandas 1.x Cookbook - Second Edition

You're reading from  Pandas 1.x Cookbook - Second Edition

Product type Book
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Pages 626 pages
Edition 2nd Edition
Languages
Authors (2):
Matt Harrison Matt Harrison
Profile icon Matt Harrison
Theodore Petrou Theodore Petrou
Profile icon Theodore Petrou
View More author details

Table of Contents (17) Chapters

Preface Pandas Foundations Essential DataFrame Operations Creating and Persisting DataFrames Beginning Data Analysis Exploratory Data Analysis Selecting Subsets of Data Filtering Rows Index Alignment Grouping for Aggregation, Filtration, and Transformation Restructuring Data into a Tidy Form Combining Pandas Objects Time Series Analysis Visualization with Matplotlib, Pandas, and Seaborn Debugging and Testing Pandas Other Books You May Enjoy
Index

Code to transform data

In this chapter, we will look at some code that analyzes survey data that Kaggle did in 2018. The survey queried Kaggle users about socio-economic information.

This section will present the survey data along with some code to analyze it. The subtitle for this data is "the most comprehensive dataset available on the state of machine learning and data science". Let's dig into this data and see what it has. The data was originally available at https://www.kaggle.com/kaggle/kaggle-survey-2018.

How to do it…

  1. Load the data into a DataFrame:
    >>> import pandas as pd
    >>> import numpy as np
    >>> import zipfile
    >>> url = 'data/kaggle-survey-2018.zip'
    >>> with zipfile.ZipFile(url) as z:
    ...     print(z.namelist())
    ...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
    ...     df = kag.iloc[1:]
    ['multipleChoiceResponses.csv', 'freeFormResponses...

Apply performance

The .apply method on a Series and DataFrame is one of the slowest operations in pandas. In this recipe, we will explore the speed of it and see if we can debug what is going on.

How to do it…

  1. Let's time how long one use of the .apply method takes using the %%timeit cell magic in Jupiter. This is the code from the tweak_kag function that limits the cardinality of the country column (Q3):
    >>> %%timeit
    >>> def limit_countries(val):
    ...      if val in  {'United States of America', 'India', 'China'}:
    ...          return val
    ...      return 'Another'
    >>> q3 = df.Q3.apply(limit_countries).rename('Country')
    6.42 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
  2. Let's look at using the .replace method instead of .apply and see if that improves performance:
    >>> %%timeit
    >>> other_values = df...

Improving apply performance with Dask, Pandarell, Swifter, and more

Sometimes .apply is convenient. Various libraries enable parallelizing such operations. There are various mechanisms to do this. The easiest is to try and leverage vectorization. Math operations are vectorized in pandas, if you add a number (say 5) to a numerical series, pandas will not add 5 to each value. Rather it will leverage a feature of modern CPUs to do the operation one time.

If you cannot vectorize, as is the case with our limit_countries function, you have other options. This section will show a few of them.

Note that you will need to install these libraries as they are not included with pandas.

The examples show limiting values in the country column from the survey data to a few values.

How to do it…

  1. Import and initialize the Pandarallel library. This library tries to parallelize pandas operations across all available CPUs. Note that this library runs fine on Linux and...

Inspecting code

The Jupyter environment has an extension that allows you to quickly pull up the documentation or the source code for a class, method, or function. I strongly encourage you to get used to using these. If you can stay in the Jupyter environment to answer questions that may come up, you will increase your productivity.

In this section, we will show how to look at the source code for the .apply method. It is easiest to look at the documentation for a DataFrame or series method directly on the DataFrame or series object, respectively. Throughout this book, we have heavily recommended chaining operations on pandas objects. Sadly Jupyter (and any other editor environment) is not able to perform code completion or look up documentation on the intermediate object returned from a chained method call. Hence the recommendation to perform the lookup directly on a method that is not chained.

How to do it…

  1. Load the survey data:
    >>>...

Debugging in Jupyter

The previous recipes have shown how to understand pandas code and inspect it from Jupyter. In this section, we will look at using the IPython debugger (ipdb) in Jupyter.

In this section, I will create a function that throws an error when I try to use it with the series .apply method. I will use ipdb to debug it.

How to do it…

  1. Load the survey data:
    >>> import zipfile
    >>> url = 'data/kaggle-survey-2018.zip'
    >>> with zipfile.ZipFile(url) as z:
    ...     kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
    ...     df = kag.iloc[1:]
    
  2. Try and run a function to add one to a series:
    >>> def add1(x):
    ...     return x + 1
    >>> df.Q3.apply(add1)
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-9-6ce28d2fea57> in <module>
      ...

Managing data integrity with Great Expectations

Great Expectations is a third-party tool that allows you to capture and define the properties of a dataset. You can save these properties and then use them to validate future data to ensure data integrity. This can be very useful when building machine learning models, as new categorical data values and numeric outliers tend to cause a model to perform poorly or error out.

In this section, we will look at the Kaggle dataset and make an expectation suite to test and validate the data.

How to do it…

  1. Read the data using the tweak_kag function previously defined:
    >>> kag = tweak_kag(df)
    
  2. Use the Great Expectations from_pandas function to read in a Great Expectations DataFrame (a subclass of DataFrame with some extra methods):
    >>> import great_expectations as ge
    >>> kag_ge = ge.from_pandas(kag)
    
  3. Examine the extra methods on the DataFrame: ...

Using pytest with pandas

In this section, we will show how to test your pandas code. We do this by testing the artifacts. We will use the third-party library, pytest, to do this testing.

For this recipe, we will not be using Jupyter, but rather the command line.

How to do it…

  1. Create a project data layout. The pytest library supports projects laid out in a couple different styles. We will create a folder structure that looks like this:
    kag-demo-pytest/
    ├── data
    │ └── kaggle-survey-2018.zip
    ├── kag.py
    └── test
        └── test_kag.py
    

    The kag.py file has code to load the raw data and code to tweak it. It looks like this:

    import pandas as pd
    import zipfile
    def load_raw(zip_fname):
        with zipfile.ZipFile(zip_fname) as z:
            kag = pd.read_csv(z.open('multipleChoiceResponses.csv'))
            df = kag.iloc[1:]
        return...

Generating tests with Hypothesis

The Hypothesis library is a third-party library for generating tests, or performing property-based testing. You create a strategy (an object that generates samples of data) and then run your code against the generated output of the strategy. You want to test an invariant, or something about your data that you presume to always hold true.

Again, there could be a book written solely about this type of testing, but in this section we will show an example of using the library.

We will show how to generate Kaggle survey data, then using that generated survey data, we will run it against the tweak_kag function and validate that the function will work on new data.

We will leverage the testing code found in the previous section. The Hypothesis library works with pytest, so we can use the same layout.

How to do it…

  1. Create a project data layout. If you had the code from the previous section, add a test_hypot.py file...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020 Publisher: Packt ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}