Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Pandas Workshop

You're reading from  The Pandas Workshop

Product type Book
Published in Jun 2022
Publisher Packt
ISBN-13 9781800208933
Pages 744 pages
Edition 1st Edition
Languages
Authors (4):
Blaine Bateman Blaine Bateman
Profile icon Blaine Bateman
Saikat Basak Saikat Basak
Profile icon Saikat Basak
Thomas V. Joseph Thomas V. Joseph
Profile icon Thomas V. Joseph
William So William So
Profile icon William So
View More author details

Table of Contents (21) Chapters

Preface Part 1 – Introduction to pandas
Chapter 1: Introduction to pandas Chapter 2: Working with Data Structures Chapter 3: Data I/O Chapter 4: Pandas Data Types Part 2 – Working with Data
Chapter 5: Data Selection – DataFrames Chapter 6: Data Selection – Series Chapter 7: Data Exploration and Transformation Chapter 8: Understanding Data Visualization Part 3 – Data Modeling
Chapter 9: Data Modeling – Preprocessing Chapter 10: Data Modeling – Modeling Basics Chapter 11: Data Modeling – Regression Modeling Part 4 – Additional Use Cases for pandas
Chapter 12: Using Time in pandas Chapter 13: Exploring Time Series Chapter 14: Applying pandas Data Processing for Case Studies Chapter 15: Appendix Other Books You May Enjoy

Chapter 2: Working with Data Structures

This chapter introduces you to the core pandas data structures—DataFrames and Series. First, you will create both these data structures from scratch, and then learn how to store them as CSV files. Then, you'll discover how to load the same data structures from CSV files. You will learn how to manipulate row indexes and columns in pandas DataFrames and Series. Furthermore, you will also discover how to convert a column into a new index. By the end of this chapter, you will be adept at manipulating pandas Series and DataFrames in Python.

This chapter covers the following topics:

  • The need for data structures
  • Exploring indexes and columns
  • Working with pandas Series

Introduction to data structures

Data structures are fundamental to computer programming languages. In Python, the core data structures are lists, sets, tuples, and dictionaries. When working in a programming environment, data structures are an abstraction that helps keep track of data, manipulate it, or change it. They also help pass large collections of data as single objects, such as sending an entire Python dictionary to a function. However, organized collections of data can be much more complex, often comprising numerous rows and columns. In this chapter, you will learn about the data structures in pandas that help you deal with such collections of data more effectively. You will dive deeper into the inner workings of these structures and discover how you can use them to accomplish your goals efficiently in Python.

In Chapter 1, Introduction to pandas, you were introduced to the ideas that led to the creation of pandas and basic concepts, such as DataFrames and Series. There...

The need for data structures

Consider that you are working with quarterly gross domestic product (GDP) data for the US. A natural way to think about the data and work with it would be to use it in a table. An example might be viewing the data in spreadsheet software, as shown here:

Figure 2.1 – Tabular data

In Figure 2.1, you see two columns of data. The spreadsheet software has labeled the columns with letters and the rows with numbers. In addition, the column names representing the data (date, GDP) are present in the first row.

The table shown in Figure 2.1 is a data structure. Having this data in two columns makes it easier to understand and work with. However, in the spreadsheet, it's complicated to work with the data as a single object (a table). This is where pandas gives you an edge over the core Python data structures (and over spreadsheets). As you saw in Chapter 1, Introduction to pandas, in pandas you can refer to the entire dataset...

Indexes and columns

We have already referred to indexes and columns without fully defining them. An index contains references to the rows of a DataFrame. The index of a pandas DataFrame is analogous to the row numbers you might see in a spreadsheet. In spreadsheets, it's common to use the so-called A1 notation, where A refers to the columns, which usually begin with A, and 1 refers to the rows, which usually begin with 1.

We will start by looking at the index, and continue with the sample_df_from_lists DataFrame created earlier. You can use the .index method to display information about the index, as follows:

sample_df_from_lists.index

This line of code produces the following output:

RangeIndex(start=0, stop=100, step=1)

You may recall that ranges in Python are inclusive of the start value and exclusive of the end value. You see that the index for sample_df_from_lists runs from 0 to 99, which matches the rows. As you will learn in detail in Chapter 5, Data Selection...

Series

The Series is the other fundamental pandas data structure. You can consider a DataFrame to be an organized collection of series, where each column is, in fact, a Series. Looking at the food_cons column of the food_taste DataFrame, you can see this relationship. The following line of code calls the type() method on the food_cons column of food_taste:

type(food_taste['food_cons'])

This generates the following output:

pandas.core.series.Series

So, every DataFrame column is a pandas Series, once separated and on its own. This would also be the case if you separated a single row from a DataFrame. Recall that you can use ? in Jupyter to get the help documentation. Try to do that and look at the first part of the Series documentation. You can use the following code to get the documentation:

?pd.Series

This provides the following output (truncated for brevity):

Figure 2.26 – The first portion of the help documentation for pandas...

Activity 2.01 – Working with pandas data structures

In this activity, you will read a DataFrame from the US_GDP.csv file, which contains information about the GDP of the US, from the first financial quarter of 2017 to the last financial quarter of 2019. The data is stored in two columns, date and GDP, and the date is read in (by default) as the object type. The goal of this activity is to first convert the date column into a timestamp and then set this column as the index. Finally, you'll save the updated dataset to a new file:

Note

You can download the file from

  1. Import the pandas library.
  2. Read the US_GDP.csv file from the Datasets directory into a DataFrame named GDP_data. The data is stored as dates and values, and you wish to use the dates as the index, so that in future work you may apply pandas time series methods to this data.
  3. Display the head of GDP_data so that you can see the formats of the data in the file.
  4. Inspect the object types of...

Summary

In this chapter, you were introduced to the two fundamental pandas data structures, DataFrames and Series, along with the basic concepts of the pandas index. With the help of some basic I/O functions, such as read_csv() and to_csv(), you saw how pandas makes it easy to read from, or write data directly into, DataFrames and Series. To illustrate the ideas, a few pandas methods were introduced in the chapter. You also learned about methods such as set_index() and the use of timestamp as an index, and used resample(), a pandas time series method that can change the time interval of data, as well as concat(), which is used to combine pandas data structures into other structures.

By now, you should be comfortable with the concept of a DataFrame and Series. The rest of the chapters in this book will build upon these concepts. In the next chapter, you will learn about data I/O using pandas.

lock icon The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022 Publisher: Packt ISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}