Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Transposing the direction of a DataFrame operation

Many DataFrame methods have an axis parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index' (or 0) or 'columns' (or 1). I prefer the string versions are they are more explicit and tend to make the code easier to read.

Nearly all DataFrame methods default the axis parameter to 0, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.

How to do it...

  1. Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    >>> college_ugds.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    
  2. Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The .count method returns the number of non-missing values. By default, its axis parameter is set to 0:
    >>> college_ugds.count()
    UGDS_WHITE    6874
    UGDS_BLACK    6874
    UGDS_HISP     6874
    UGDS_ASIAN    6874
    UGDS_AIAN     6874
    UGDS_NHPI     6874
    UGDS_2MOR     6874
    UGDS_NRA      6874
    UGDS_UNKN     6874
    dtype: int64
    

    The axis parameter is almost always set to 0. So, step 2 is equivalent to both college_ugds.count(axis=0) and college_ugds.count(axis='index').

  3. Changing the axis parameter to 'columns' changes the direction of the operation so that we get back a count of non-missing items in each row:
    >>> college_ugds.count(axis="columns").head()
    INSTNM
    Alabama A & M University               9
    University of Alabama at Birmingham    9
    Amridge University                     9
    University of Alabama in Huntsville    9
    Alabama State University               9
    dtype: int64
    
  4. Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The .sum method may be used to verify this:
    >>> college_ugds.sum(axis="columns").head()
    INSTNM
    Alabama A & M University               1.0000
    University of Alabama at Birmingham    0.9999
    Amridge University                     1.0000
    University of Alabama in Huntsville    1.0000
    Alabama State University               1.0000
    dtype: float64
    
  5. To get an idea of the distribution of each column, the .median method can be used:
    >>> college_ugds.median(axis="index")
    UGDS_WHITE    0.55570
    UGDS_BLACK    0.10005
    UGDS_HISP     0.07140
    UGDS_ASIAN    0.01290
    UGDS_AIAN     0.00260
    UGDS_NHPI     0.00000
    UGDS_2MOR     0.01750
    UGDS_NRA      0.00000
    UGDS_UNKN     0.01430
    dtype: float64
    

How it works...

The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).

There's more...

The .cumsum method with axis=1 accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:

>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
Universit...      0.5922      0.8522  ...    0.9899     0.9999
Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
Universit...      0.6988      0.8243  ...    0.9650     1.0000
Alabama S...      0.0158      0.9366  ...    0.9863     1.0000
Previous PageNext Page
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou