You're reading from Pandas 1.x Cookbook - Second Edition
Introduction
The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is important for pandas users to know the difference between a Series and a DataFrame.
The pandas library is useful for dealing with structured data. What is structured data? Data that is stored in tables, such as CSV files, Excel spreadsheets, or database tables, is all structured. Unstructured data consists of free form text, images, sound, or video. If you find yourself dealing with structured data, pandas will be of great utility to you.
In this chapter, you will learn how to select a single column of data from a DataFrame (a two-dimensional dataset), which is returned as a Series (a one-dimensional dataset). Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession...
The pandas DataFrame
Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components—the index, columns, and data that you must be aware of to maximize the DataFrame's full potential.
This recipe reads in the movie dataset into a pandas DataFrame and provides a labeled diagram of all its major components.
>>> movies = pd.read_csv("data/movie.csv")
>>> movies
color direc/_name ... aspec/ratio movie/likes
0 Color James Cameron ... 1.78 33000
1 Color Gore Verbinski ... 2.35 0
2 Color Sam Mendes ... 2.35 85000
3 Color Christopher Nolan ... 2.35 164000
4 NaN Doug Walker .....
DataFrame attributes
Each of the three DataFrame components–the index, columns, and data–may be accessed from a DataFrame. You might want to perform operations on the individual components and not on the DataFrame as a whole. In general, though we can pull out the data into a NumPy array, unless all the columns are numeric, we usually leave it in a DataFrame. DataFrames are ideal for managing heterogenous columns of data, NumPy arrays not so much.
This recipe pulls out the index, columns, and the data of the DataFrame into their own variables, and then shows how the columns and index are inherited from the same object.
How to do it…
- Use the DataFrame attributes index, columns, and values to assign the index, columns, and data to their own variables:
>>> movies = pd.read_csv("data/movie.csv") >>> columns = movies.columns >>> index = movies.index >>> data = movies.to_numpy()
- Display...
Understanding data types
In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.
pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following describes common pandas data types:
float
– The NumPy float type, which supports missing valuesint
– The NumPy integer type, which does not support missing values'Int64'
– pandas nullable integer typeobject
– The NumPy type for storing strings (and mixed types)'category'
– pandas categorical type, which does...
Selecting a column
Selected a single column from a DataFrame returns a Series (that has the same index as the DataFrame). It is a single dimension of data, composed of just an index and the data. You can also create a Series by itself without a DataFrame, but it is more common to pull them off of a DataFrame.
This recipe examines two different syntaxes to select a single column of data, a Series. One syntax uses the index operator and the other uses attribute access (or dot notation).
How to do it…
- Pass a column name as a string to the indexing operator to select a Series of data:
>>> movies = pd.read_csv("data/movie.csv") >>> movies["director_name"] 0 James Cameron 1 Gore Verbinski 2 Sam Mendes 3 Christopher Nolan 4 Doug Walker ... 4911 Scott Smith 4912 NaN 4913 Benjamin Roberds 4914 Daniel...
Calling Series methods
A typical workflow in pandas will have you going back and forth between executing statements on Series and DataFrames. Calling Series methods is the primary way to use the abilities that the Series offers.
Both Series and DataFrames have a tremendous amount of power. We can use the built-in dir
function to uncover all the attributes and methods of a Series. In the following code, we also show the number of attributes and methods common to both Series and DataFrames. Both of these objects share the vast majority of attribute and method names:
>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400
As you can see there is a lot of functionality on both of these objects. Don't be overwhelmed by this. Most pandas users only use a subset of the functionality and get...
Series operations
There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:
>>> 5 + 9 # plus operator example. Adds 5 and 9
14
Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.
In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.
How to do it…
- Select the
imdb_score
column as a Series:>>> movies = pd.read_csv("data/movie.csv") >>> imdb_score = movies["imdb_score"] >>> imdb_score 0 7.9 1 7.1 2 6.8 3 8.5 4 7.1 ... 4911 7.7 4912 7.5 4913 6.3 4914 6.3 4915 6.6 Name: imdb_score, Length: 4916, dtype: float64
- Use the plus operator...
Chaining Series methods
In Python, every variable points to an object, and many attributes and methods return new objects. This allows sequential invocation of methods using attribute access. This is called method chaining or flow programming. pandas is a library that lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called.
To motivate method chaining, let's take an English sentence and translate the chain of events into a chain of methods. Consider the sentence: A person drives to the store to buy food, then drives home and prepares, cooks, serves, and eats the food before cleaning the dishes.
A Python version of this sentence might take the following form:
(person.drive('store')
.buy('food')
.drive('home')
.prepare('food')
.cook('food')
.serve('food')
.eat('food...
Renaming column names
One of the most common operations on a DataFrame is to rename the column names. I like to rename my columns so that they are also valid Python attribute names. This means that they do not start with numbers and are lowercased alphanumerics with underscores. Good column names should also be descriptive, brief, and not clash with existing DataFrame or Series attributes.
In this recipe, the column names are renamed. The motivation for renaming is to make your code easier to understand, and also let your environment assist you. Recall that Jupyter will allow you to complete Series methods if you accessed the Series using dot notation (but will not allow method completion on index access).
How to do it…
- Read in the movie dataset, and make the index meaningful by setting it as the movie title:
>>> movies = pd.read_csv("data/movie.csv")
- The renamed DataFrame method accepts dictionaries that map the old...
Creating and deleting columns
During data analysis, it is likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. pandas has a few different ways to add new columns to a DataFrame.
In this recipe, we create new columns in the movie dataset by using the .assign
method and then delete columns with the .drop
method.
How to do it…
- One way to create a new column is to do an index assignment. Note that this will not return a new DataFrame but mutate the existing DataFrame. If you assign the column to a scalar value, it will use that value for every cell in the column. Let's create the
has_seen
column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:>>> movies = pd.read_csv("data/movie.csv") >>> movies...