You're reading from The Pandas Workshop

Product typeBook

Published inJun 2022

Reading LevelBeginner

PublisherPackt

ISBN-139781800208933

Edition1st Edition

Languages

Python

Tools

NumPy Pandas

Concepts

Data Science

Authors (4):

Blaine Bateman

Saikat Basak

Thomas V. Joseph

William So

View More author details

Chapter 5: Data Selection – DataFrames

In this chapter, you will develop an understanding of the different forms of the pandas index, and how the index is involved in slicing, which is one way to get a subset of a pandas data structure. You will learn how to manipulate the index itself, as well as the different notations pandas provides for selection.

By the end of this chapter, you will be able to select subsets of data and work with the index efficiently. You will also learn how to implement pandas dot, bracket, .loc(), and .iloc() notations to slice and index.

This chapter covers the following topics:

Introduction to DataFrames
Data selection in pandas DataFrames
Activity 5.01 – Creating a multi-index from columns
Bracket and dot notation
Changing DataFrame values using bracket and dot notation

Introduction to DataFrames

Imagine that you're working on a dataset that contains hundreds of columns and thousands of rows, of which only a small subset – say a dozen rows and two or three columns – matter to you for a particular analysis. In such cases, it's better to isolate and focus on those rows and columns rather than working with the entire dataset. In data analysis and data science, you will constantly need to work with a subset of a larger dataset. Thankfully, pandas provides selection methods that make this process easy and efficient. You will learn about these methods in this chapter. We will start by revisiting DataFrames and then see how pandas selection methods apply to DataFrames.

So far in this book, you have learned about the basics of the pandas data structures (Chapter 2, Data Structures), how to get data in or out of pandas (Chapter 3, Data I/O), and the different data types in pandas (Chapter 4, Data Types). Now, it is time to integrate...

Data selection in pandas DataFrames

In Chapter 3, Data Structures, we studied the two core pandas data structures, DataFrames and Series. There, we did some very basic data selection without digging into the details of how it works. In this section, we will do a deeper dive and explore the index, which is fundamental to many pandas operations.

As you may recall when we introduced the idea of DataFrames, we drew analogies to spreadsheets. Let's revisit that analogy. Here is the same figure from Chapter 2, Data Structures (which is the data from Figure 5.1 but in a spreadsheet):

Figure 5.2 – The industry GDP data in a spreadsheet

Here, we can see the same three columns of data that were shown in Figure 5.1, but we have annotated the key differences. In pandas, the standard row index starts at 0, while for most spreadsheets, it starts at row 1. This "0 indexing" is standard for Python. An index in pandas is a series of numbers or strings...

Activity 5.01 – Creating a multi-index from columns

In this activity, you will read in a DataFrame from a file and then use some of the columns to create a sorted multi-index. Suppose you have been given a .csv file containing data about mushrooms, which, as you understand it, contains a classification of edible or poisonous mushrooms, as well as many visual features to allow them to be identified. Since you are a mushroom hunting enthusiast, you are very interested in analyzing and summarizing the data. You begin by reading the data in. Let's get started:

For this activity, all you will need is the pandas library. Load it into the first cell of the notebook.
Read in the mushroom.csv data from the Datasets directory and list the first five rows using .head().
You will see the class column and many visible attributes. List all the columns to see what else there is to work with. The result should be as follows:

Figure 5.32 –...

Bracket and dot notation

In the previous section, we focused on the DataFrame.loc method. pandas offers two ways to select data – using just brackets, [], and using what is called dot notation (pandas also refers to the latter as attribute access since object.name is Python syntax for accessing the name attribute in object).

Bracket notation

We have already introduced one form of bracket notation, which is using a column name inside brackets. There are several ways to apply bracket notation to a DataFrame, as follows:

Select entire columns: DataFrame['column_name'] or DataFrame[[list of column names]]. If a single column is selected, the result is a Series; otherwise, the result is a DataFrame. If an additional selection results in only one row, the result can be a Series. Also, if the DataFrame only contains one row, selecting one column returns a Series (even though the result is a single value).
Selecting a range of rows: DataFrame[start:end...

Changing DataFrame values using bracket or dot notation

Many of the methods we've discussed can be used to change the values in a DataFrame, as well as select slices or ranges. In the following screenshot, we can see the GDP data that we have been working with for 2015:

Figure 5.50 – The new GDP_2015 DataFrame

Now, suppose that as part of the economic analysis, we want to increase all the GDP values by 5,000. We can do this by selecting the GDP column using bracket notation on the left, and then doing the same and adding 5,000 on the right:

GDP_2015['GDP'] = GDP_2015['GDP'] + 5000

GDP_2015

This will produce the following output:

Figure 5.51 – The GDP_2015 DataFrame with every value in the GDP column increased by 5,000

Here, we can see the expected result – that is, all our GDP figures have been increased by 5,000. Thus, using bracket notation, we can choose where new data goes into...

Summary

In this chapter, you learned about the pandas methods for data indexing and selection by using the primary pandas data structure – the DataFrame. You compared the DataFrame.loc() and DataFrame.iloc() methods to access items in DataFrames by labels and integer locations, respectively. You also looked at some pandas shortcut methods, including bracket notation, dot notation, and extended indexing. Along the way, you saw how the pandas index is used behind the scenes to align data, and how that can be changed by changing or resetting the index. In addition, we showed you that in many cases, you can assign new values to a subset of data by using it on the left-hand side of an assignment statement (using the equals operator). This creates a very compact and easy-to-read coding style. We saw that an important pandas capability that involved using labels for the row or column index produced more robust code – instead of "hardcoding" the column numbers, they...

The rest of the chapter is locked

You have been reading a chapter from

The Pandas Workshop

Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages