Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

Every dimension of data in a Series or DataFrame is labeled in the Index object. It is this Index that separates pandas data structures from NumPy's n-dimensional array. Indexes provide meaningful labels for each row and column of data, and pandas users can select data through the use of these labels. Additionally, pandas allows its users to select data according to the position of the rows and columns. This dual selection capability, one using names and the other using the position, makes for powerful yet confusing syntax to select subsets of data.

Selecting data by label or position is not unique to pandas. Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways. Both dictionaries and lists have precise instructions and limited use cases for what you can index with. A dictionary's key (its label) must be an immutable object, such as a string, integer, or tuple. Lists must either use integers...

Selecting Series data

Series and DataFrames are complex data containers that have multiple attributes that use an index operation to select data in different ways. In addition to the index operator itself, the .iloc and .loc attributes are available and use the index operator in their own unique ways.

Series and DataFrames allow selection by position (like Python lists) and by label (like Python dictionaries). When we index off of the .iloc attribute, pandas selects only by position and works similarly to Python lists. The .loc attribute selects only by index label, which is similar to how Python dictionaries work.

The .loc and .iloc attributes are available on both Series and DataFrames. This recipe shows how to select Series data by position with .iloc and by label with .loc. These indexers accept scalar values, lists, and slices.

The terminology can get confusing. An index operation is when you put brackets, [], following a variable. For instance, given a Series s,...

Selecting DataFrame rows

The most explicit and preferred way to select DataFrame rows is with .iloc and .loc. They are both capable of selecting by rows or by rows and columns.

This recipe shows you how to select rows from a DataFrame using the .iloc and .loc indexers:

  1. Read in the college dataset, and set the index as the institution name:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college.sample(5, random_state=42)
                         CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                            ...
    Career Po...  San Antonio     TX  ...        20700            14977
    Ner Israe...    Baltimore     MD  ...  PrivacyS...      PrivacyS...
    Reflectio...      Decatur     IL  ...          NaN      PrivacyS...
    Capital A...  Baton Rouge     LA  ...        26400      PrivacyS...
    West Virg...   Montgomery     WV  ...        43400            23969
    <BLANKLINE>...

Selecting DataFrame rows and columns simultaneously

There are many ways to select rows and columns. The easiest method to select one or more columns from a DataFrame is to index off of the DataFrame. However, this approach has a limitation. Indexing directly on a DataFrame does not allow you to select both rows and columns simultaneously. To select rows and columns, you will need to pass both valid row and column selections separated by a comma to either .iloc or .loc.

The generic form to select rows and columns will look like the following code:

df.iloc[row_idxs, column_idxs]
df.loc[row_names, column_names]

Where row_idxs and column_idxs can be scalar integers, lists of integers, or integer slices. While row_names and column_names can be the scalar names, lists of names, or names slices, row_names can also be a Boolean array.

In this recipe, each step shows a simultaneous row and column selection using both .iloc and .loc.

How to do it…

...

Selecting data with both integers and labels

Sometimes, you want the functionality of both .iloc and .loc, to select data by both position and label. In earlier versions of pandas, .ix was available to select data by both position and label. While this conveniently worked for those specific situations, it was ambiguous by nature and was a source of confusion for many pandas users. The .ix indexer has subsequently been deprecated and thus should be avoided.

Before the .ix deprecation, it was possible to select the first five rows and the columns of the college dataset from UGDS_WHITE through UGDS_UNKN using college.ix[:5, 'UGDS_WHITE':'UGDS_UNKN']. This is now impossible to do directly using .loc or .iloc. The following recipe shows how to find the integer location of the columns and then use .iloc to complete the selection.

How to do it…

  1. Read in the college dataset and assign the institution name (INSTNM) as the index:
    >...

Slicing lexicographically

The .loc attribute typically selects data based on the exact string label of the index. However, it also allows you to select data based on the lexicographic order of the values in the index. Specifically, .loc allows you to select all rows with an index lexicographically using slice notation. This only works if the index is sorted.

In this recipe, you will first sort the index and then use slice notation inside the .loc indexer to select all rows between two strings.

How to do it…

  1. Read in the college dataset, and set the institution name as the index:
    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    
  2. Attempt to select all colleges with names lexicographically between Sp and Su:
    >>> college.loc["Sp":"Su"]
    Traceback (most recent call last):
      ...
    ValueError: index must be monotonic increasing or decreasing
    During handling...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou