Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 6: Data Selection – Series

In this chapter, you'll use most of the methods you've learned about for DataFrames to select data from a pandas Series.

By the end of this chapter, you will have a complete understanding of the Series Index, know how to apply the dot, bracket, and extended indexing methods, and how to use .loc[] and .iloc[] to select data from a Series.

In this chapter, we will cover the following topics:

  • Introduction to pandas Series
  • The Series index
  • Data selection in pandas Series
  • Preparing Series from DataFrames and vice versa
  • Activity 6.01 – Series data selection
  • Understanding the differences between base Python and pandas data selection
  • Activity 6.02 – DataFrame data selection

Introduction to pandas Series

In Chapter 5, Data Selection – DataFrames, we introduced several ways you can select data from pandas DataFrames. While a pandas Series can be thought of as a single column of a pandas DataFrame, it is a separate data structure. In this chapter, we are going to learn how to select data from a Series in detail. The key methods, such as .loc[] and .iloc[], will still apply to a one-dimensional Series, as well as some of the more advanced methods such as Boolean indexing and extended indexing. Now that you have mastered the methods you can apply to DataFrames, learning about Series will be very similar and intuitive. Toward the end of this chapter, we will spend some time understanding the differences between pandas and base Python regarding selecting data. This will reinforce some of the ideas and methods you have learned about. Conceptually, the same ideas we used to select elements from a DataFrame can be used to select elements from a Series....

The Series index

Let's say we have some monthly income data from a YouTube channel. We create a Series with some values (monthly earnings in USD) in a list, and an index of month abbreviations, also in a list, using a constructor similar to what we've used for DataFrames. Note that we can add a name for the Series using the name argument:

import pandas as pd
income = pd.Series([100, 125, 105, 111, 275, 137, 
                     99, 10, 250, 100, 175, 200],
                   index = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                            'Jul', &apos...

Preparing Series from DataFrames and vice versa

In Chapter 5, Data Selection – DataFrames, we saw examples of getting a Series by slicing the column of a DataFrame. Let's review this. You have been provided with a dataset (adapted from https://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant) regarding a water treatment facility and you've been asked to analyze its performance. The data contains various chemical measurements for the input, two settling stages, and the output, plus some performance indicators. We will begin by reading the water-treatment.csv file. After reading the data, we will use the .fillna() method, which replaces any missing values, which are converted into NaN values during the file read, into the value that's passed to .fillna(). We will use a value of -9999 here:

water_data = pd.read_csv('Datasets\\water-treatment.csv')
water_data.fillna(-9999, inplace = True)
water_data

Note

Please change the path of the dataset...

Activity 6.01 – Series data selection

In this activity, you will read some US population data for large cities for the years 2010 and 2019 and analyze it. The goal is to determine the population growth for the top three cities compared to all the top 20 from 2010 to 2019. To do this, you must compute the population of the three largest cities for 2010 and 2019, as well as the population of the 20 largest cities for both years. Using these values, you can compute the growth rates and compare them.

Follow these steps to complete this activity:

  1. For this activity, all you will need is the pandas library. Load it into the first cell of the notebook.
  2. Read in a pandas Series from the US_Census_SUB-IP-EST2019-ANNRNK_top_20_2010.csv file. This data is from the US Census Bureau (source: https://www2.census.gov/programs-surveys/popest/datasets/2010/2010-eval-estimates/). The city names are in the first column, so read them so that they are used as the indexes. List the resulting...

Understanding the differences between base Python and pandas data selection

For the most part, once you have learned a bit of pandas notation for slicing and indexing, pandas objects work nearly transparently with core Python. Since the indexing of some different object types looks similar, here, we'll touch on some of the differences so that you can avoid surprises in the future.

Lists versus Series access

Python lists look superficially like Series. When you're using bracket notation to index a Series, it works much the same way as indexing a list. Here, we make a simple list using the range() function, then print out 11 values within the list:

my_list = list(range(100))
print(my_list[12:33])

This will produce the following output:

[12  13,  14,  15,  16,  17,  18,  19,  20,  21,  22]

Now, let's attempt the same thing, but using .iloc[]:

print(my_list...

Activity 6.02 – DataFrame data selection

In this activity, you need to analyze data from this year's survey of Abalone oysters for the National Marine Fisheries Service (the source data can be found in the UCI repository: https://archive.ics.uci.edu/ml/datasets/abalone). In particular, you want to get some summary values for the dimensions of male and female samples in the data, depending on the number of rings in the oysters' shells. The ring count is a measure of age, and reviewing this data provides comparisons to previous years to help you understand the health of the population. The data contains several observations, including sex, length, diameter, weight, shell weight, and the number of rings.

To complete this activity, follow these steps:

  1. For this activity, all you will need is the pandas library. Load it into the first cell of the notebook.
  2. Read the abalone.csv file into a DataFrame called abalone and view the first five rows.
  3. Create a...

Summary

In this chapter, we have learned about the pandas methods of data indexing and selection using a Series. We compared the Series.loc() and Series.iloc() methods for accessing items in a Series by labels and integer locations, respectively. We also used pandas shortcut methods, including bracket notation and extended indexing. We reviewed that most methods for DataFrames work similarly and intuitively for a pandas Series, and we highlighted a few key differences. After understanding indexes and how to access them, we illustrated differences between core pandas data structures such as lists and dictionaries, as well as some things to keep in mind regarding pandas and core Python.

At this point, you should be comfortable working with pandas data access as well as understand the common pitfalls and workarounds. With these tools in hand, you are ready to tackle data projects of any complexity. In the next chapter, Chapter 7, Data Transformation, you will apply some of these methods...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So