You're reading from Hands-On Data Analysis with Pandas - Second Edition

Product typeBook

Published inApr 2021

Reading LevelIntermediate

PublisherPackt

ISBN-139781800563452

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Databases

Author (1)

Stefanie Molin

Chapter 2: Working with Pandas DataFrames

The time has come for us to begin our journey into the pandas universe. This chapter will get us comfortable working with some of the basic, yet powerful, operations we will be performing when conducting our data analyses with pandas.

We will begin with an introduction to the main data structures we will encounter when working with pandas. Data structures provide us with a format for organizing, managing, and storing data. Knowledge of pandas data structures will prove infinitely helpful when it comes to troubleshooting or looking up how to perform an operation on the data. Keep in mind that these data structures are different from the standard Python data structures for a reason: they were created for specific analysis tasks. We must remember that a given method may only work on a certain data structure, so we need to be able to identify the best structure for the problem we are looking to solve.

Next, we will bring our first dataset...

Chapter materials

The files we will be working with in this chapter can be found in the GitHub repository at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_02. We will be working with earthquake data from the US Geological Survey (USGS) by using the USGS API and CSV files, which can be found in the data/ directory.

There are four CSV files and a SQLite database file, all of which will be used at different points throughout this chapter. The earthquakes.csv file contains data that's been pulled from the USGS API for September 18, 2018 through October 13, 2018. For our discussion of data structures, we will work with the example_data.csv file, which contains five rows and a subset of the columns from the earthquakes.csv file. The tsunamis.csv file is a subset of the data in the earthquakes.csv file for all earthquakes that were accompanied by tsunamis during the aforementioned date range. The quakes.db file contains a SQLite database...

Pandas data structures

Python has several data structures already, such as tuples, lists, and dictionaries. Pandas provides two main structures to facilitate working with data: Series and DataFrame. The Series and DataFrame data structures each contain another pandas data structure, Index, that we must also be aware of. However, in order to understand these data structures, we need to first take a look at NumPy (https://numpy.org/doc/stable/), which provides the n-dimensional arrays that pandas builds upon.

The aforementioned data structures are implemented as Python classes; when we actually create one, they are referred to as objects or instances. This is an important distinction, since, as we will see, some actions can be performed using the object itself (a method), whereas others will require that we pass our object in as an argument to some function. Note that, in Python, class names are traditionally written in CapWords, while objects are written in snake_case. (More Python...

Creating a pandas DataFrame

Now that we understand the data structures we will be working with, we can discuss the different ways we can create them. Before we dive into the code however, it's important to know how to get help right from Python. Should we ever find ourselves unsure of how to use something in Python, we can utilize the built-in help() function. We simply run help(), passing in the package, module, class, object, method, or function that we want to read the documentation on. We can, of course, look up the documentation online; however, in most cases, the docstrings (the documentation text written in the code) that are returned with help() will be equivalent to this since they are used to generate the documentation.

Assuming we first ran import pandas as pd, we can run help(pd) to display information about the pandas package; help(pd.DataFrame) for all the methods and attributes of DataFrame objects (note we can also pass in a DataFrame object instead); and help...

Inspecting a DataFrame object

The first thing we should do when we read in our data is inspect it; we want to make sure that our dataframe isn't empty and that the rows look as we would expect. Our main goal is to verify that it was read in properly and that all the data is there; however, this initial inspection will also give us ideas with regard to where we should direct our data wrangling efforts. In this section, we will explore ways in which we can inspect our dataframes in the 4-inspecting_dataframes.ipynb notebook.

Since this is a new notebook, we must once again handle our setup. This time, we need to import pandas and numpy, as well as read in the CSV file with the earthquake data:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv('data/earthquakes.csv')

Examining the data

First, we want to make sure that we actually have data in our dataframe. We can check the empty attribute to find out:

>>...

Grabbing subsets of the data

So far, we have learned how to work with and summarize the data as a whole; however, we will often be interested in performing operations and/or analyses on subsets of our data. There are many types of subsets we may look to isolate from our data, such as selecting only specific columns or rows as a whole or when a specific criterion is met. In order to obtain subsets of the data, we need to be familiar with selection, slicing, indexing, and filtering.

For this section, we will work in the 5-subsetting_data.ipynb notebook. Our setup is as follows:

>>> import pandas as pd
>>> df = pd.read_csv('data/earthquakes.csv')

Selecting columns

In the previous section, we saw an example of column selection when we looked at the unique values in the alert column; we accessed the column as an attribute of the dataframe. Remember that a column is a Series object, so, for example, selecting the mag column in the earthquake data gives...

Adding and removing data

In the previous sections, we frequently selected a subset of the columns, but if columns/rows aren't useful to us, we should just get rid of them. We also frequently selected data based on the value of the mag column; however, if we had made a new column holding the Boolean values for later selection, we would have only needed to calculate the mask once. Very rarely will we get data where we neither want to add nor remove something.

Before we begin adding and removing data, it's important to understand that while most methods will return a new DataFrame object, some will be in-place and change our data. If we write a function where we pass in a dataframe and change it, it will change our original dataframe as well. Should we find ourselves in a situation where we don't want to change the original data, but rather want to return a new copy of the data that has been modified, we must be sure to copy our dataframe before making any changes:

...

Summary

In this chapter, we learned how to use pandas for the data collection portion of data analysis and to describe our data with statistics, which will be helpful when we get to the drawing conclusions phase. We learned the main data structures of the pandas library, along with some of the operations we can perform on them. Next, we learned how to create DataFrame objects from a variety of sources, including flat files and API requests. Using earthquake data, we discussed how to summarize our data and calculate statistics from it. Subsequently, we addressed how to take subsets of data via selection, slicing, indexing, and filtering. Finally, we practiced adding and removing both columns and rows from our dataframe.

These tasks also form the backbone of our pandas workflow and the foundation for the new topics we will cover in the next few chapters on data wrangling, aggregation, and data visualization. Be sure to complete the exercises provided in the next section before moving...

Exercises

Using the data/parsed.csv file and the material from this chapter, complete the following exercises to practice your pandas skills:

Find the 95th percentile of earthquake magnitude in Japan using the mb magnitude type.
Find the percentage of earthquakes in Indonesia that were coupled with tsunamis.
Calculate summary statistics for earthquakes in Nevada.
Add a column indicating whether the earthquake happened in a country or US state that is on the Ring of Fire. Use Alaska, Antarctica (look for Antarctic), Bolivia, California, Canada, Chile, Costa Rica, Ecuador, Fiji, Guatemala, Indonesia, Japan, Kermadec Islands, Mexico (be careful not to select New Mexico), New Zealand, Peru, Philippines, Russia, Taiwan, Tonga, and Washington.
Calculate the number of earthquakes in the Ring of Fire locations and the number outside of them.
Find the tsunami count along the Ring of Fire.

Comparison with R / R Libraries: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html
Comparison with SQL: https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
SQL Queries: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html

The following are some resources on working with serialized data:

Pickle in Python: Object Serialization: https://www.datacamp.com/community/tutorials/pickle-python-tutorial
Read RData/RDS files into pandas.DataFrame objects (pyreader): https://github.com/ofajardo/pyreadr

Additional resources for working with APIs are as follows:

Documentation for the requests package: https://requests.readthedocs.io/en/master/
HTTP Methods: https://restfulapi.net/http-methods/
HTTP Status Codes: https://restfulapi...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Analysis with Pandas - Second Edition

Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Other recommended products

Related to this chapter

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Machine Learning with scikit-learn Quick Start Guide

Scikit-learn is a robust machine learning library for the Python programming language. It provides a set of supervised and unsupervised learning algorithms. This book is the easiest way to learn how to deploy, optimize and evaluate all the important machine learning algorithms that scikit-learn provides.

BookOct 2018172 pages

Hands-On Exploratory Data Analysis with Python

This book provides practical knowledge about the main pillars of EDA including data cleaning, data preparation, data exploration, and data visualization. You can leverage the power of Python to understand, summarize and investigate your data in the best way possible. The book presents a unique approach to exploring hidden features in your data.

BookMar 2020352 pages

Become a Python Data Analyst

Become a Python Data Analyst book introduces you to the mainstream libraries of Python’s Data Science stack. With proven examples and real-world datasets, this book teaches how to effectively perform data manipulation, visualize and analyze data patterns and brings you to the ladder of advanced topics like Predictive Analytics.

BookAug 2018178 pages

Hands-On Financial Trading with Python

This book focuses on key Python analytics and algorithmic trading libraries used for backtesting. With the help of practical examples, you will learn the principle aspects of trading strategy development. The 14 profitable strategies included in the book will also help you build intuitions that will enable you to create your own strategy.

BookApr 2021360 pages

Hands-On Gradient Boosting with XGBoost and scikit-learn

This practical XGBoost guide will put your Python and scikit-learn knowledge to work by showing you how to build powerful, fine-tuned XGBoost models with impressive speed and accuracy. This book will help you to apply XGBoost’s alternative base learners, use unique transformers for model deployment, discover tips from Kaggle masters, and much more!

BookOct 2020310 pages

Pandas Cookbook

Explore pandas, the powerful Python library for data analysis and manipulation by working on real-world datasets. Get to grips with the fundamentals and learn to use pandas to clean messy data, independently analyze groups within your data, make powerful time-series calculations, and create beautiful visualizations during exploratory data analysis.

BookOct 2017532 pages

Pandas 1.x Cookbook

A new edition of the bestselling Pandas cookbook updated to pandas 1.x with new chapters on creating and testing, and exploratory data analysis. Recipes are written with modern pandas constructs. This book also covers EDA, tidying data, pivoting data, time-series calculations, visualizations, and more.

BookFeb 2020626 pages

Python for Finance Cookbook

Python is becoming the number one language for data science and also quantitative finance. This book provides you with solutions to common tasks from the intersection of quantitative finance and data science, using modern Python libraries.

BookJan 2020432 pages

scikit-learn Cookbook

scikit-learn has evolved as a robust library for machine learning applications in python with support for a wide range of supervised and unsupervised learning algorithms. This edition brings to you the various enhancements to its model implementations, API and bug fixes in the latest major release of scikit-learn to support Python. This book covers easy to follow recipes right from mathematical operations to implementing various supervised, unsupervised and deep learning algorithms with scikit-learn. Get practical hands-on knowledge to implement various models and algorithms like Multi-Layer Perceptrons, time-series split, MAE criterion for regression, criteria for gradient boosting, Classifier, Regressor, and much more.

BookNov 2017374 pages

Applied Supervised Learning with Python

Applied Supervised Learning with Python provides you a rich understanding of machine learning, one of the most pursued topics in information science, and Python, one of the most popular scripting languages. Through this book, you'll learn Jupyter Notebooks, the technology used in academic and commercial circles with in-line code running support.

BookApr 2019404 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Hands-On Data Analysis with Pandas - Second Edition

Chapter 2: Working with Pandas DataFrames

Chapter materials

Pandas data structures

Creating a pandas DataFrame

Inspecting a DataFrame object

Examining the data

Grabbing subsets of the data

Selecting columns

Adding and removing data

Summary

Exercises

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Python Data Cleaning Cookbook

Learning pandas

Machine Learning with scikit-learn Quick Start Guide

Hands-On Exploratory Data Analysis with Python

Become a Python Data Analyst

Hands-On Financial Trading with Python

Hands-On Gradient Boosting with XGBoost and scikit-learn

Pandas Cookbook

Pandas 1.x Cookbook

Python for Finance Cookbook

Python is becoming the number one language for data science and also quantitative finance. This book provides you with solutions to common tasks from the intersection of quantitative finance and data science, using modern Python libraries.

scikit-learn Cookbook

Applied Supervised Learning with Python

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook