Reader small image

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product typeBook
Published inApr 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800563452
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Stefanie Molin
Stefanie Molin
author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Right arrow

Chapter 2: Working with Pandas DataFrames

The time has come for us to begin our journey into the pandas universe. This chapter will get us comfortable working with some of the basic, yet powerful, operations we will be performing when conducting our data analyses with pandas.

We will begin with an introduction to the main data structures we will encounter when working with pandas. Data structures provide us with a format for organizing, managing, and storing data. Knowledge of pandas data structures will prove infinitely helpful when it comes to troubleshooting or looking up how to perform an operation on the data. Keep in mind that these data structures are different from the standard Python data structures for a reason: they were created for specific analysis tasks. We must remember that a given method may only work on a certain data structure, so we need to be able to identify the best structure for the problem we are looking to solve.

Next, we will bring our first dataset...

Chapter materials

The files we will be working with in this chapter can be found in the GitHub repository at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_02. We will be working with earthquake data from the US Geological Survey (USGS) by using the USGS API and CSV files, which can be found in the data/ directory.

There are four CSV files and a SQLite database file, all of which will be used at different points throughout this chapter. The earthquakes.csv file contains data that's been pulled from the USGS API for September 18, 2018 through October 13, 2018. For our discussion of data structures, we will work with the example_data.csv file, which contains five rows and a subset of the columns from the earthquakes.csv file. The tsunamis.csv file is a subset of the data in the earthquakes.csv file for all earthquakes that were accompanied by tsunamis during the aforementioned date range. The quakes.db file contains a SQLite database...

Pandas data structures

Python has several data structures already, such as tuples, lists, and dictionaries. Pandas provides two main structures to facilitate working with data: Series and DataFrame. The Series and DataFrame data structures each contain another pandas data structure, Index, that we must also be aware of. However, in order to understand these data structures, we need to first take a look at NumPy (https://numpy.org/doc/stable/), which provides the n-dimensional arrays that pandas builds upon.

The aforementioned data structures are implemented as Python classes; when we actually create one, they are referred to as objects or instances. This is an important distinction, since, as we will see, some actions can be performed using the object itself (a method), whereas others will require that we pass our object in as an argument to some function. Note that, in Python, class names are traditionally written in CapWords, while objects are written in snake_case. (More Python...

Creating a pandas DataFrame

Now that we understand the data structures we will be working with, we can discuss the different ways we can create them. Before we dive into the code however, it's important to know how to get help right from Python. Should we ever find ourselves unsure of how to use something in Python, we can utilize the built-in help() function. We simply run help(), passing in the package, module, class, object, method, or function that we want to read the documentation on. We can, of course, look up the documentation online; however, in most cases, the docstrings (the documentation text written in the code) that are returned with help() will be equivalent to this since they are used to generate the documentation.

Assuming we first ran import pandas as pd, we can run help(pd) to display information about the pandas package; help(pd.DataFrame) for all the methods and attributes of DataFrame objects (note we can also pass in a DataFrame object instead); and help...

Inspecting a DataFrame object

The first thing we should do when we read in our data is inspect it; we want to make sure that our dataframe isn't empty and that the rows look as we would expect. Our main goal is to verify that it was read in properly and that all the data is there; however, this initial inspection will also give us ideas with regard to where we should direct our data wrangling efforts. In this section, we will explore ways in which we can inspect our dataframes in the 4-inspecting_dataframes.ipynb notebook.

Since this is a new notebook, we must once again handle our setup. This time, we need to import pandas and numpy, as well as read in the CSV file with the earthquake data:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/earthquakes.csv')

Examining the data

First, we want to make sure that we actually have data in our dataframe. We can check the empty attribute to find out:

>>...

Grabbing subsets of the data

So far, we have learned how to work with and summarize the data as a whole; however, we will often be interested in performing operations and/or analyses on subsets of our data. There are many types of subsets we may look to isolate from our data, such as selecting only specific columns or rows as a whole or when a specific criterion is met. In order to obtain subsets of the data, we need to be familiar with selection, slicing, indexing, and filtering.

For this section, we will work in the 5-subsetting_data.ipynb notebook. Our setup is as follows:

>>> import pandas as pd
>>> df = pd.read_csv('data/earthquakes.csv')

Selecting columns

In the previous section, we saw an example of column selection when we looked at the unique values in the alert column; we accessed the column as an attribute of the dataframe. Remember that a column is a Series object, so, for example, selecting the mag column in the earthquake data gives...

Adding and removing data

In the previous sections, we frequently selected a subset of the columns, but if columns/rows aren't useful to us, we should just get rid of them. We also frequently selected data based on the value of the mag column; however, if we had made a new column holding the Boolean values for later selection, we would have only needed to calculate the mask once. Very rarely will we get data where we neither want to add nor remove something.

Before we begin adding and removing data, it's important to understand that while most methods will return a new DataFrame object, some will be in-place and change our data. If we write a function where we pass in a dataframe and change it, it will change our original dataframe as well. Should we find ourselves in a situation where we don't want to change the original data, but rather want to return a new copy of the data that has been modified, we must be sure to copy our dataframe before making any changes:

...

Summary

In this chapter, we learned how to use pandas for the data collection portion of data analysis and to describe our data with statistics, which will be helpful when we get to the drawing conclusions phase. We learned the main data structures of the pandas library, along with some of the operations we can perform on them. Next, we learned how to create DataFrame objects from a variety of sources, including flat files and API requests. Using earthquake data, we discussed how to summarize our data and calculate statistics from it. Subsequently, we addressed how to take subsets of data via selection, slicing, indexing, and filtering. Finally, we practiced adding and removing both columns and rows from our dataframe.

These tasks also form the backbone of our pandas workflow and the foundation for the new topics we will cover in the next few chapters on data wrangling, aggregation, and data visualization. Be sure to complete the exercises provided in the next section before moving...

Exercises

Using the data/parsed.csv file and the material from this chapter, complete the following exercises to practice your pandas skills:

  1. Find the 95th percentile of earthquake magnitude in Japan using the mb magnitude type.
  2. Find the percentage of earthquakes in Indonesia that were coupled with tsunamis.
  3. Calculate summary statistics for earthquakes in Nevada.
  4. Add a column indicating whether the earthquake happened in a country or US state that is on the Ring of Fire. Use Alaska, Antarctica (look for Antarctic), Bolivia, California, Canada, Chile, Costa Rica, Ecuador, Fiji, Guatemala, Indonesia, Japan, Kermadec Islands, Mexico (be careful not to select New Mexico), New Zealand, Peru, Philippines, Russia, Taiwan, Tonga, and Washington.
  5. Calculate the number of earthquakes in the Ring of Fire locations and the number outside of them.
  6. Find the tsunami count along the Ring of Fire.

Further reading

Those with an R and/or SQL background may find it helpful to see how the pandas syntax compares:

  • Comparison with R / R Libraries: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html
  • Comparison with SQL: https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
  • SQL Queries: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html

The following are some resources on working with serialized data:

  • Pickle in Python: Object Serialization: https://www.datacamp.com/community/tutorials/pickle-python-tutorial
  • Read RData/RDS files into pandas.DataFrame objects (pyreader): https://github.com/ofajardo/pyreadr

Additional resources for working with APIs are as follows:

  • Documentation for the requests package: https://requests.readthedocs.io/en/master/
  • HTTP Methods: https://restfulapi.net/http-methods/
  • HTTP Status Codes: https://restfulapi...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin