Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 5: Data Selection – DataFrames

In this chapter, you will develop an understanding of the different forms of the pandas index, and how the index is involved in slicing, which is one way to get a subset of a pandas data structure. You will learn how to manipulate the index itself, as well as the different notations pandas provides for selection.

By the end of this chapter, you will be able to select subsets of data and work with the index efficiently. You will also learn how to implement pandas dot, bracket, .loc(), and .iloc() notations to slice and index.

This chapter covers the following topics:

  • Introduction to DataFrames
  • Data selection in pandas DataFrames
  • Activity 5.01 – Creating a multi-index from columns
  • Bracket and dot notation
  • Changing DataFrame values using bracket and dot notation

Introduction to DataFrames

Imagine that you're working on a dataset that contains hundreds of columns and thousands of rows, of which only a small subset – say a dozen rows and two or three columns – matter to you for a particular analysis. In such cases, it's better to isolate and focus on those rows and columns rather than working with the entire dataset. In data analysis and data science, you will constantly need to work with a subset of a larger dataset. Thankfully, pandas provides selection methods that make this process easy and efficient. You will learn about these methods in this chapter. We will start by revisiting DataFrames and then see how pandas selection methods apply to DataFrames.

So far in this book, you have learned about the basics of the pandas data structures (Chapter 2, Data Structures), how to get data in or out of pandas (Chapter 3, Data I/O), and the different data types in pandas (Chapter 4, Data Types). Now, it is time to integrate...

Data selection in pandas DataFrames

In Chapter 3, Data Structures, we studied the two core pandas data structures, DataFrames and Series. There, we did some very basic data selection without digging into the details of how it works. In this section, we will do a deeper dive and explore the index, which is fundamental to many pandas operations.

As you may recall when we introduced the idea of DataFrames, we drew analogies to spreadsheets. Let's revisit that analogy. Here is the same figure from Chapter 2, Data Structures (which is the data from Figure 5.1 but in a spreadsheet):

Figure 5.2 – The industry GDP data in a spreadsheet

Here, we can see the same three columns of data that were shown in Figure 5.1, but we have annotated the key differences. In pandas, the standard row index starts at 0, while for most spreadsheets, it starts at row 1. This "0 indexing" is standard for Python. An index in pandas is a series of numbers or strings...

Activity 5.01 – Creating a multi-index from columns

In this activity, you will read in a DataFrame from a file and then use some of the columns to create a sorted multi-index. Suppose you have been given a .csv file containing data about mushrooms, which, as you understand it, contains a classification of edible or poisonous mushrooms, as well as many visual features to allow them to be identified. Since you are a mushroom hunting enthusiast, you are very interested in analyzing and summarizing the data. You begin by reading the data in. Let's get started:

  1. For this activity, all you will need is the pandas library. Load it into the first cell of the notebook.
  2. Read in the mushroom.csv data from the Datasets directory and list the first five rows using .head().
  3. You will see the class column and many visible attributes. List all the columns to see what else there is to work with. The result should be as follows:

Figure 5.32 –...

Bracket and dot notation

In the previous section, we focused on the DataFrame.loc method. pandas offers two ways to select data – using just brackets, [], and using what is called dot notation (pandas also refers to the latter as attribute access since object.name is Python syntax for accessing the name attribute in object).

Bracket notation

We have already introduced one form of bracket notation, which is using a column name inside brackets. There are several ways to apply bracket notation to a DataFrame, as follows:

  • Select entire columns: DataFrame['column_name'] or DataFrame[[list of column names]]. If a single column is selected, the result is a Series; otherwise, the result is a DataFrame. If an additional selection results in only one row, the result can be a Series. Also, if the DataFrame only contains one row, selecting one column returns a Series (even though the result is a single value).
  • Selecting a range of rows: DataFrame[start:end...

Changing DataFrame values using bracket or dot notation

Many of the methods we've discussed can be used to change the values in a DataFrame, as well as select slices or ranges. In the following screenshot, we can see the GDP data that we have been working with for 2015:

Figure 5.50 – The new GDP_2015 DataFrame

Now, suppose that as part of the economic analysis, we want to increase all the GDP values by 5,000. We can do this by selecting the GDP column using bracket notation on the left, and then doing the same and adding 5,000 on the right:

GDP_2015['GDP'] = GDP_2015['GDP'] + 5000
GDP_2015

This will produce the following output:

Figure 5.51 – The GDP_2015 DataFrame with every value in the GDP column increased by 5,000

Here, we can see the expected result – that is, all our GDP figures have been increased by 5,000. Thus, using bracket notation, we can choose where new data goes into...

Summary

In this chapter, you learned about the pandas methods for data indexing and selection by using the primary pandas data structure – the DataFrame. You compared the DataFrame.loc() and DataFrame.iloc() methods to access items in DataFrames by labels and integer locations, respectively. You also looked at some pandas shortcut methods, including bracket notation, dot notation, and extended indexing. Along the way, you saw how the pandas index is used behind the scenes to align data, and how that can be changed by changing or resetting the index. In addition, we showed you that in many cases, you can assign new values to a subset of data by using it on the left-hand side of an assignment statement (using the equals operator). This creates a very compact and easy-to-read coding style. We saw that an important pandas capability that involved using labels for the row or column index produced more robust code – instead of "hardcoding" the column numbers, they...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So