You're reading from Pandas 1.x Cookbook - Second Edition

Product typeBook

Published inFeb 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839213106

Edition2nd Edition

Languages

Python

Tools

Pandas

Concepts

Programming Language

Authors (2):

Matt Harrison

Theodore Petrou

View More author details

Importing pandas

Most users of the pandas library will use an import alias so they can refer to it as pd. In general in this book, we will not show the pandas and NumPy imports, but they look like this:

>>> import pandas as pd
>>> import numpy as np

Introduction

The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is important for pandas users to know the difference between a Series and a DataFrame.

The pandas library is useful for dealing with structured data. What is structured data? Data that is stored in tables, such as CSV files, Excel spreadsheets, or database tables, is all structured. Unstructured data consists of free form text, images, sound, or video. If you find yourself dealing with structured data, pandas will be of great utility to you.

In this chapter, you will learn how to select a single column of data from a DataFrame (a two-dimensional dataset), which is returned as a Series (a one-dimensional dataset). Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession...

The pandas DataFrame

Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components—the index, columns, and data that you must be aware of to maximize the DataFrame's full potential.

This recipe reads in the movie dataset into a pandas DataFrame and provides a labeled diagram of all its major components.

>>> movies = pd.read_csv("data/movie.csv")
>>> movies
      color        direc/_name  ...  aspec/ratio  movie/likes
0     Color      James Cameron  ...         1.78        33000
1     Color     Gore Verbinski  ...         2.35            0
2     Color         Sam Mendes  ...         2.35        85000
3     Color  Christopher Nolan  ...         2.35       164000
4       NaN        Doug Walker  .....

DataFrame attributes

Each of the three DataFrame components–the index, columns, and data–may be accessed from a DataFrame. You might want to perform operations on the individual components and not on the DataFrame as a whole. In general, though we can pull out the data into a NumPy array, unless all the columns are numeric, we usually leave it in a DataFrame. DataFrames are ideal for managing heterogenous columns of data, NumPy arrays not so much.

This recipe pulls out the index, columns, and the data of the DataFrame into their own variables, and then shows how the columns and index are inherited from the same object.

How to do it…

Use the DataFrame attributes index, columns, and values to assign the index, columns, and data to their own variables:

>>> movies = pd.read_csv("data/movie.csv")
>>> columns = movies.columns
>>> index = movies.index
>>> data = movies.to_numpy()

Display...

Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following describes common pandas data types:

float – The NumPy float type, which supports missing values
int – The NumPy integer type, which does not support missing values
'Int64' – pandas nullable integer type
object – The NumPy type for storing strings (and mixed types)
'category' – pandas categorical type, which does...

Selecting a column

Selected a single column from a DataFrame returns a Series (that has the same index as the DataFrame). It is a single dimension of data, composed of just an index and the data. You can also create a Series by itself without a DataFrame, but it is more common to pull them off of a DataFrame.

This recipe examines two different syntaxes to select a single column of data, a Series. One syntax uses the index operator and the other uses attribute access (or dot notation).

How to do it…

Pass a column name as a string to the indexing operator to select a Series of data:

>>> movies = pd.read_csv("data/movie.csv")
>>> movies["director_name"]
0           James Cameron
1          Gore Verbinski
2              Sam Mendes
3       Christopher Nolan
4             Doug Walker
              ...        
4911          Scott Smith
4912                  NaN
4913     Benjamin Roberds
4914          Daniel...

Calling Series methods

A typical workflow in pandas will have you going back and forth between executing statements on Series and DataFrames. Calling Series methods is the primary way to use the abilities that the Series offers.

Both Series and DataFrames have a tremendous amount of power. We can use the built-in dir function to uncover all the attributes and methods of a Series. In the following code, we also show the number of attributes and methods common to both Series and DataFrames. Both of these objects share the vast majority of attribute and method names:

>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400

As you can see there is a lot of functionality on both of these objects. Don't be overwhelmed by this. Most pandas users only use a subset of the functionality and get...

Series operations

There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:

>>> 5 + 9  # plus operator example. Adds 5 and 9
14

Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.

In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.

How to do it…

Select the imdb_score column as a Series:

>>> movies = pd.read_csv("data/movie.csv")
>>> imdb_score = movies["imdb_score"]
>>> imdb_score
0       7.9
1       7.1
2       6.8
3       8.5
4       7.1
       ... 
4911    7.7
4912    7.5
4913    6.3
4914    6.3
4915    6.6
Name: imdb_score, Length: 4916, dtype: float64

Use the plus operator...

Chaining Series methods

In Python, every variable points to an object, and many attributes and methods return new objects. This allows sequential invocation of methods using attribute access. This is called method chaining or flow programming. pandas is a library that lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called.

To motivate method chaining, let's take an English sentence and translate the chain of events into a chain of methods. Consider the sentence: A person drives to the store to buy food, then drives home and prepares, cooks, serves, and eats the food before cleaning the dishes.

A Python version of this sentence might take the following form:

(person.drive('store')
       .buy('food')
       .drive('home')
       .prepare('food')
       .cook('food')
       .serve('food')
       .eat('food...

Renaming column names

One of the most common operations on a DataFrame is to rename the column names. I like to rename my columns so that they are also valid Python attribute names. This means that they do not start with numbers and are lowercased alphanumerics with underscores. Good column names should also be descriptive, brief, and not clash with existing DataFrame or Series attributes.

In this recipe, the column names are renamed. The motivation for renaming is to make your code easier to understand, and also let your environment assist you. Recall that Jupyter will allow you to complete Series methods if you accessed the Series using dot notation (but will not allow method completion on index access).

How to do it…

Read in the movie dataset, and make the index meaningful by setting it as the movie title:
```
>>> movies = pd.read_csv("data/movie.csv")
```
The renamed DataFrame method accepts dictionaries that map the old...

Creating and deleting columns

During data analysis, it is likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. pandas has a few different ways to add new columns to a DataFrame.

In this recipe, we create new columns in the movie dataset by using the .assign method and then delete columns with the .drop method.

How to do it…

One way to create a new column is to do an index assignment. Note that this will not return a new DataFrame but mutate the existing DataFrame. If you assign the column to a scalar value, it will use that value for every cell in the column. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:
```
>>> movies = pd.read_csv("data/movie.csv")
>>> movies...
```

The rest of the chapter is locked

You have been reading a chapter from

Pandas 1.x Cookbook - Second Edition

Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

Other recommended products

Related to this chapter

Mastering Exploratory Analysis with pandas

Exploratory data analysis exploits the visual properties of the datasets that are commonly used by data scientists. It helps you build custom data pipelines to address data analysis tasks. This book uses pandas, the most popular Python library for data analysis, and helps you build end-to-end exploratory data-analysis solutions

BookSep 2018140 pages

Python Data Cleaning Cookbook

The book shows you how to view data from multiple perspectives, including data frame and column attributes. You will cover common and not-so-common challenges that are faced while cleaning messy data for complex situations. You will learn to manipulate data and get them down to a form that can be useful for making the right decisions.

BookDec 2020436 pages

Learning pandas

Pandas is a popular Python package used for practical, real world data analysis. It provides efficient fast, high-performance data structures that makes data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.

BookJun 2017446 pages

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

BookJun 2018168 pages5

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

BookOct 2019674 pages

Hands-On Data Analysis with Pandas

This book will be a handy guide to quickly learn pandas and understand how it can empower you in the exciting world of data manipulation, analysis, and data science. You will learn how to use pandas to perform numeric and statistical analysis using real-world examples. You will also visualize statistical data and apply pandas to different domains.

BookJul 2019740 pages

Hands-On Data Analysis with Pandas

Knowing how to work with data to extract insights generates significant value. This book will help you to develop data analysis skills using a hands-on approach and real-world data. You’ll get up to speed with pandas 1.x in no time and build some software engineering skills in the process, vastly expanding your data science toolbox.

BookApr 2021788 pages5

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Pandas 1.x Cookbook - Second Edition

Importing pandas

Introduction

The pandas DataFrame

DataFrame attributes

How to do it…

Understanding data types

Selecting a column

How to do it…

Calling Series methods

Series operations

How to do it…

Chaining Series methods

Renaming column names

How to do it…

Creating and deleting columns

How to do it…

Unlock this book and the full library FREE for 7 days

Authors (2)

Mastering Exploratory Analysis with pandas

Python Data Cleaning Cookbook

Learning pandas

Hands-On Data Analysis with NumPy and Pandas

In this book, you will explore two important Python packages used by Data Analysts, NumPy &amp; pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.

Mastering pandas

pandas is a popular Python library used by data scientists and analysts worldwide to manipulate and analyze their data. This book presents useful techniques and real-world examples on getting the most out of pandas for expert-level data manipulation, analysis and visualization.

Hands-On Data Analysis with Pandas

Hands-On Data Analysis with Pandas

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

In this book, you will explore two important Python packages used by Data Analysts, NumPy & pandas. You will dive into different concepts such as reading, sorting, grouping of data, and also learn how to work with different data formats for your data analysis projects.