Pandas 1.x Cookbook - Second Edition

By Matt Harrison , Theodore Petrou
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Pandas Foundations
About this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter.

This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.

Publication date:
February 2020
Publisher
Packt
Pages
626
ISBN
9781839213106

 

Importing pandas

Most users of the pandas library will use an import alias so they can refer to it as pd. In general in this book, we will not show the pandas and NumPy imports, but they look like this:

>>> import pandas as pd
>>> import numpy as np
 

Introduction

The goal of this chapter is to introduce a foundation of pandas by thoroughly inspecting the Series and DataFrame data structures. It is important for pandas users to know the difference between a Series and a DataFrame.

The pandas library is useful for dealing with structured data. What is structured data? Data that is stored in tables, such as CSV files, Excel spreadsheets, or database tables, is all structured. Unstructured data consists of free form text, images, sound, or video. If you find yourself dealing with structured data, pandas will be of great utility to you.

In this chapter, you will learn how to select a single column of data from a DataFrame (a two-dimensional dataset), which is returned as a Series (a one-dimensional dataset). Working with this one-dimensional object makes it easy to show how different methods and operators work. Many Series methods return another Series as output. This leads to the possibility of calling further methods in succession, which is known as method chaining.

The Index component of the Series and DataFrame is what separates pandas from most other data analysis libraries and is the key to understanding how many operations work. We will get a glimpse of this powerful object when we use it as a meaningful label for Series values. The final two recipes contain tasks that frequently occur during a data analysis.

 

The pandas DataFrame

Before diving deep into pandas, it is worth knowing the components of the DataFrame. Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components—the index, columns, and data that you must be aware of to maximize the DataFrame's full potential.

This recipe reads in the movie dataset into a pandas DataFrame and provides a labeled diagram of all its major components.

>>> movies = pd.read_csv("data/movie.csv")
>>> movies
      color        direc/_name  ...  aspec/ratio  movie/likes
0     Color      James Cameron  ...         1.78        33000
1     Color     Gore Verbinski  ...         2.35            0
2     Color         Sam Mendes  ...         2.35        85000
3     Color  Christopher Nolan  ...         2.35       164000
4       NaN        Doug Walker  ...          NaN            0
...     ...                ...  ...          ...          ...
4911  Color        Scott Smith  ...          NaN           84
4912  Color                NaN  ...        16.00        32000
4913  Color   Benjamin Roberds  ...          NaN           16
4914  Color        Daniel Hsia  ...         2.35          660
4915  Color           Jon Gunn  ...         1.85          456
dataframe anatomy

DataFrame anatomy

How it works…

pandas first reads the data from disk into memory and into a DataFrame using the read_csv function. By convention, the terms index label and column name refer to the individual members of the index and columns, respectively. The term index refers to all the index labels as a whole, just as the term columns refers to all the column names as a whole.

The labels in index and column names allow for pulling out data based on the index and column name. We will show that later. The index is also used for alignment. When multiple Series or DataFrames are combined, the indexes align first before any calculation occurs. A later recipe will show this as well.

Collectively, the columns and the index are known as the axes. More specifically, the index is axis 0, and the columns are axis 1.

pandas uses NaN (not a number) to represent missing values. Notice that even though the color column has string values, it uses NaN to represent a missing value.

The three consecutive dots, ..., in the middle of the columns indicate that there is at least one column that exists but is not displayed due to the number of columns exceeding the predefined display limits. By default, pandas shows 60 rows and 20 columns, but we have limited that in the book, so the data fits in a page.

The .head method accepts an optional parameter, n, which controls the number of rows displayed. The default value for n is 5. Similarly, the .tail method returns the last n rows.

 

DataFrame attributes

Each of the three DataFrame components–the index, columns, and data–may be accessed from a DataFrame. You might want to perform operations on the individual components and not on the DataFrame as a whole. In general, though we can pull out the data into a NumPy array, unless all the columns are numeric, we usually leave it in a DataFrame. DataFrames are ideal for managing heterogenous columns of data, NumPy arrays not so much.

This recipe pulls out the index, columns, and the data of the DataFrame into their own variables, and then shows how the columns and index are inherited from the same object.

How to do it…

  1. Use the DataFrame attributes index, columns, and values to assign the index, columns, and data to their own variables:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> columns = movies.columns
    >>> index = movies.index
    >>> data = movies.to_numpy()
    
  2. Display each component's values:
    >>> columns
    Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
           'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
           'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
           'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
           'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
           'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],      dtype='object')
    >>> index RangeIndex(start=0, stop=4916, step=1)
    >>> data
    array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
           ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
           ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
           ...,
           ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
           ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
           ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
    
  3. Output the Python type of each DataFrame component (the word following the last dot of the output):
    >>> type(index)
    <class 'pandas.core.indexes.range.RangeIndex'>
    >>> type(columns)
    <class 'pandas.core.indexes.base.Index'>
    >>> type(data)
    <class 'numpy.ndarray'>
    
  4. The index and the columns are closely related. Both of them are subclasses of Index. This allows you to perform similar operations on both the index and the columns:
    >>> issubclass(pd.RangeIndex, pd.Index)
    True
    >>> issubclass(columns.__class__, pd.Index)
    True
    

How it works…

The index and the columns represent the same thing but along different axes. They are occasionally referred to as the row index and column index.

There are many types of index objects in pandas. If you do not specify the index, pandas will use a RangeIndex. A RangeIndex is a subclass of an Index that is analogous to Python's range object. Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory. It is completely defined by its start, stop, and step values.

There's more...

When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are similar to Python sets in that they support operations such as intersection and union, but are dissimilar because they are ordered and can have duplicate entries.

Notice how the .values DataFrame attribute returned a NumPy n-dimensional array, or ndarray. Most of pandas relies heavily on the ndarray. Beneath the index, columns, and data are NumPy ndarrays. They could be considered the base object for pandas that many other objects are built upon. To see this, we can look at the values of the index and columns:

>>> index.to_numpy()
array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64))
>>> columns.to_numpy()
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)
       

Having said all of that, we usually do not access the underlying NumPy objects. We tend to leave the objects as pandas objects and use pandas operations. However, we regularly apply NumPy functions to pandas objects.

 

Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurements, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following describes common pandas data types:

  • float – The NumPy float type, which supports missing values
  • int – The NumPy integer type, which does not support missing values
  • 'Int64' – pandas nullable integer type
  • object – The NumPy type for storing strings (and mixed types)
  • 'category' – pandas categorical type, which does support missing values
  • bool – The NumPy Boolean type, which does not support missing values (None becomes False, np.nan becomes True)
  • 'boolean' – pandas nullable Boolean type
  • datetime64[ns] – The NumPy date type, which does support missing values (NaT)

In this recipe, we display the data type of each column in a DataFrame. After you ingest data, it is crucial to know the type of data held in each column as it fundamentally changes the kind of operations that are possible with it.

How to do it…

  1. Use the .dtypes attribute to display each column name along with its data type:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies.dtypes
    color                       object
    director_name               object
    num_critic_for_reviews     float64
    duration                   float64
    director_facebook_likes    float64
                                ...   
    title_year                 float64
    actor_2_facebook_likes     float64
    imdb_score                 float64
    aspect_ratio               float64
    movie_facebook_likes         int64
    Length: 28, dtype: object
    
  2. Use the .value_counts method to return the counts of each data type:
    >>> movies.dtypes.value_counts()
    float64    13
    int64       3
    object     12
    dtype: int64
    
  3. Look at the .info method:
    >>> movies.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4916 entries, 0 to 4915
    Data columns (total 28 columns):
    color                        4897 non-null object
    director_name                4814 non-null object
    num_critic_for_reviews       4867 non-null float64
    duration                     4901 non-null float64
    director_facebook_likes      4814 non-null float64
    actor_3_facebook_likes       4893 non-null float64
    actor_2_name                 4903 non-null object
    actor_1_facebook_likes       4909 non-null float64
    gross                        4054 non-null float64
    genres                       4916 non-null object
    actor_1_name                 4909 non-null object
    movie_title                  4916 non-null object
    num_voted_users              4916 non-null int64
    cast_total_facebook_likes    4916 non-null int64
    actor_3_name                 4893 non-null object
    facenumber_in_poster         4903 non-null float64 plot_keywords                4764 non-null object
    movie_imdb_link              4916 non-null object
    num_user_for_reviews         4895 non-null float64
    language                     4904 non-null object
    country                      4911 non-null object
    content_rating               4616 non-null object
    budget                       4432 non-null float64
    title_year                   4810 non-null float64
    actor_2_facebook_likes       4903 non-null float64
    imdb_score                   4916 non-null float64
    aspect_ratio                 4590 non-null float64
    movie_facebook_likes         4916 non-null int64
    dtypes: float64(13), int64(3), object(12)
    memory usage: 1.1+ MB
    

How it works…

Each DataFrame column lists one type. For instance, every value in the column aspect_ratio is a 64-bit float, and every value in movie_facebook_likes is a 64-bit integer. pandas defaults its core numeric types, integers, and floats to 64 bits regardless of the size necessary for all data to fit in memory. Even if a column consists entirely of the integer value 0, the data type will still be int64.

The .value_counts method returns the count of all the data types in the DataFrame when called on the .dtypes attribute.

The object data type is the one data type that is unlike the others. A column that is of the object data type may contain values that are of any valid Python object. Typically, when a column is of the object data type, it signals that the entire column is strings. When you load CSV files and string columns are missing values, pandas will stick in a NaN (float) for that cell. So the column might have both object and float (missing) values in it. The .dtypes attribute will show the column as an object (or O on the series). It will not show it as a mixed type column (that contains both strings and floats):

>>> pd.Series(["Paul", np.nan, "George"]).dtype
dtype('O')

The .info method prints the data type information in addition to the count of non-null values. It also lists the amount of memory used by the DataFrame. This is useful information, but is printed on the screen. The .dtypes attribute returns a pandas Series if you needed to use the data.

There's more…

Almost all of pandas data types are built from NumPy. This tight integration makes it easier for users to integrate pandas and NumPy operations. As pandas grew larger and more popular, the object data type proved to be too generic for all columns with string values. pandas created its own categorical data type to handle columns of strings (or numbers) with a fixed number of possible values.

 

Selecting a column

Selected a single column from a DataFrame returns a Series (that has the same index as the DataFrame). It is a single dimension of data, composed of just an index and the data. You can also create a Series by itself without a DataFrame, but it is more common to pull them off of a DataFrame.

This recipe examines two different syntaxes to select a single column of data, a Series. One syntax uses the index operator and the other uses attribute access (or dot notation).

How to do it…

  1. Pass a column name as a string to the indexing operator to select a Series of data:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies["director_name"]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  2. Alternatively, you may use attribute access to accomplish the same task:
    >>> movies.director_name
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  3. We can also index off of the .loc and .iloc attributes to pull out a Series. The former allows us to pull out by column name, while the latter by position. These are referred to as label-based and positional-based in the pandas documentation.

    The usage of .loc specifies a selector for both rows and columns separated by a comma. The row selector is a slice with no start or end name (:) which means select all of the rows. The column selector will just pull out the column named director_name.

    The .iloc index operation also specifies both row and column selectors. The row selector is the slice with no start or end index (:) that selects all of the rows. The column selector, 1, pulls off the second column (remember that Python is zero-based):

    >>> movies.loc[:, "director_name"]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    >>> movies.iloc[:, 1]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  4. Jupyter shows the series in a monospace font, and shows the index, type, length, and name of the series. It will also truncate data according to the pandas configuration settings. See the image for a description of these.
    series anatomy

    Series anatomy

    You can also view the index, type, length, and name of the series with the appropriate attributes:

    >>> movies["director_name"].index
    RangeIndex(start=0, stop=4916, step=1)
    >>> movies["director_name"].dtype
    dtype('O')
    >>> movies["director_name"].size
    4196
    >>> movies["director_name"].name
    'director_name'
    
  5. Verify that the output is a Series:
    >>> type(movies["director_name"])
    <class 'pandas.core.series.Series'>
    
  6. Note that even though the type is reported as object, because there are missing values, the Series has both floats and strings in it. We can use the .apply method with the type function to get back a Series that has the type of every member. Rather than looking at the whole Series result, we will chain the .unique method onto the result, to look at just the unique types that are found in the director_name column:
    >>> movies["director_name"].apply(type).unique()
    array([<class 'str'>, <class 'float'>], dtype=object)
    

How it works…

A pandas DataFrame typically has multiple columns (though it may also have only one column). Each of these columns can be pulled out and treated as a Series.

There are many mechanisms to pull out a column from a DataFrame. Typically the easiest is to try and access it as an attribute. Attribute access is done with the dot operator (.notation). There are good things about this:

  • Least amount of typing
  • Jupyter will provide completion on the name
  • Jupyter will provide completion on the Series attributes

There are some downsides as well:

  • Only works with columns that have names that are valid Python attributes and do not conflict with existing DataFrame attributes
  • Cannot create a new column, can only update existing ones

What is a valid Python attribute? A sequence of alphanumerics that starts with a character and includes underscores. Typically these are in lowercase to follow standard Python naming conventions. This means that column names with spaces or special characters will not work with an attribute.

Selecting column names using the index operator ([) will work with any column name. You can also create and update columns with this operator. Jupyter will provide completion on the column name when you use the index operator, but sadly, will not complete on subsequent Series attributes.

I often find myself using attribute access because getting completion on the Series attribute is very handy. But, I also make sure that the column names are valid Python attribute names that don't conflict with existing DataFrame attributes. I also try not to update using either attribute or index assignment, but rather using the .assign method. You will see many examples of using .assign in this book.

There's more…

To get completion in Jupyter an press the Tab key following a dot, or after starting a string in an index access. Jupyter will pop up a list of completions, and you can use the up and down arrow keys to highlight one, and hit Enter to complete it.

 

Calling Series methods

A typical workflow in pandas will have you going back and forth between executing statements on Series and DataFrames. Calling Series methods is the primary way to use the abilities that the Series offers.

Both Series and DataFrames have a tremendous amount of power. We can use the built-in dir function to uncover all the attributes and methods of a Series. In the following code, we also show the number of attributes and methods common to both Series and DataFrames. Both of these objects share the vast majority of attribute and method names:

>>> s_attr_methods = set(dir(pd.Series))
>>> len(s_attr_methods)
471
>>> df_attr_methods = set(dir(pd.DataFrame))
>>> len(df_attr_methods)
458
>>> len(s_attr_methods & df_attr_methods)
400

As you can see there is a lot of functionality on both of these objects. Don't be overwhelmed by this. Most pandas users only use a subset of the functionality and get along just fine.

This recipe covers the most common and powerful Series methods and attributes. Many of the methods are nearly equivalent for DataFrames.

How to do it…

  1. After reading in the movies dataset, select two Series with different data types. The director_name column contains strings (pandas calls this an object or O data type), and the column actor_1_facebook_likes contains numerical data (formally float64):
    >>> movies = pd.read_csv("data/movie.csv")
    >>> director = movies["director_name"]
    >>> fb_likes = movies["actor_1_facebook_likes"]
    >>> director.dtype
    dtype('O')
    >>> fb_likes.dtype
    dtype('float64')
    
  2. The .head method lists the first five entries of a Series. You may provide an optional argument to change the number of entries returned. Another option is to use the .sample method to view some of the data. Depending on your dataset, this might provide better insight into your data as the first rows might be very different from subsequent rows:
    >>> director.head()
    0        James Cameron
    1       Gore Verbinski
    2           Sam Mendes
    3    Christopher Nolan
    4          Doug Walker
    Name: director_name, dtype: object
    >>> director.sample(n=5, random_state=42)
    2347      Brian Percival
    4687         Lucio Fulci
    691        Phillip Noyce
    3911       Sam Peckinpah
    2488    Rowdy Herrington
    Name: director_name, dtype: object
    >>> fb_likes.head()
    0     1000.0
    1    40000.0
    2    11000.0
    3    27000.0
    4      131.0
    Name: actor_1_facebook_likes, dtype: float64
    
  3. The data type of the Series usually determines which of the methods will be the most useful. For instance, one of the most useful methods for the object data type Series is .value_counts, which calculates the frequencies:
    >>> director.value_counts()
    Steven Spielberg    26
    Woody Allen         22
    Clint Eastwood      20
    Martin Scorsese     20
    Ridley Scott        16
                        ..
    Eric England         1
    Moustapha Akkad      1
    Jay Oliva            1
    Scott Speer          1
    Leon Ford            1
    Name: director_name, Length: 2397, dtype: int64
    
  4. The .value_counts method is typically more useful for Series with object data types but can occasionally provide insight into numeric Series as well. Used with fb_likes, it appears that higher numbers have been rounded to the nearest thousand as it is unlikely that so many movies received exactly 1,000 likes:
    >>> fb_likes.value_counts()
    1000.0     436
    11000.0    206
    2000.0     189
    3000.0     150
    12000.0    131
              ... 
    362.0        1
    216.0        1
    859.0        1
    225.0        1
    334.0        1
    Name: actor_1_facebook_likes, Length: 877, dtype: int64
    
  5. Counting the number of elements in the Series may be done with the .size or .shape attribute or the built-in len function. The .unique method will return a NumPy array with the unique values:
    >>> director.size
    4916
    >>> director.shape
    (4916,)
    >>> len(director)
    4916
    >>> director.unique()
    array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
           'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)
    
  6. Additionally, there is the .count method, which doesn't return the count of items, but the number of non-missing values:
    >>> director.count()
    4814
    >>> fb_likes.count()
    4909
    
  7. Basic summary statistics are provided with .min, .max, .mean, .median, and .std:
    >>> fb_likes.min()
    0.0
    >>> fb_likes.max()
    640000.0
    >>> fb_likes.mean()
    6494.488490527602
    >>> fb_likes.median()
    982.0
    >>> fb_likes.std()
    15106.986883848309
    
  8. To simplify step 7, you may use the .describe method to return both the summary statistics and a few of the quantiles at once. When .describe is used with an object data type column, a completely different output is returned:
    >>> fb_likes.describe()
    count      4909.000000
    mean       6494.488491
    std       15106.986884
    min           0.000000
    25%         607.000000
    50%         982.000000
    75%       11000.000000
    max      640000.000000
    Name: actor_1_facebook_likes, dtype: float64
    >>> director.describe()
    count                 4814
    unique                2397
    top       Steven Spielberg
    freq                    26
    Name: director_name, dtype: object
    
  9. The .quantile method calculates the quantile of numeric data. Note that if you pass in a scaler, you will get scalar output, but if you pass in a list, the output is a pandas Series:
    >>> fb_likes.quantile(0.2)
    510.0
    >>> fb_likes.quantile(
    ...     [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    ... )
    0.1      240.0
    0.2      510.0
    0.3      694.0
    0.4      854.0
    0.5      982.0
    0.6     1000.0
    0.7     8000.0
    0.8    13000.0
    0.9    18000.0
    Name: actor_1_facebook_likes, dtype: float64
    
  10. Since the .count method in step 6 returned a value less than the total number of Series elements found in step 5, we know that there are missing values in each Series. The .isna method can be used to determine whether each individual value is missing or not. The result is a Series. You may see this referred to as a Boolean array (a Series with Boolean values that has the same index and length as the original Series):
    >>> director.isna()
    0       False
    1       False
    2       False
    3       False
    4       False
            ...  
    4911    False
    4912     True
    4913    False
    4914    False
    4915    False
    Name: director_name, Length: 4916, dtype: bool
    
  11. It is possible to replace all missing values within a Series with the .fillna method:
    >>> fb_likes_filled = fb_likes.fillna(0)
    >>> fb_likes_filled.count()
    4916
    
  12. To remove the entries in Series elements with missing values, use the .dropna method:
    >>> fb_likes_dropped = fb_likes.dropna()
    >>> fb_likes_dropped.size
    4909
    

How it works…

The methods used in this recipe were chosen because of how frequently they are used in data analysis.

The steps in this recipe return different types of objects.

The result from the .head method in step 1 is another Series. The .value_counts method also produces a Series but has the unique values from the original Series as the index and the count as its values. In step 5, the .size property and .count method return scalar values, but the .shape property returns a one-item tuple. This is a convention borrowed from NumPy, which allows for arrays of arbitrary dimensions.

In step 7, each individual method returns a scalar value.

In step 8, the .describe method returns a Series with all the summary statistic names as the index and the statistic as the values.

In step 9, the .quantile method is flexible and returns a scalar value when passed a single value but returns a Series when given a list.

In steps 10, 11, and 12, .isna, .fillna, and .dropna all return a Series.

There's more…

The .value_counts method is one of the most informative Series methods and heavily used during exploratory analysis, especially with categorical columns. It defaults to returning the counts, but by setting the normalize parameter to True, the relative frequencies are returned instead, which provides another view of the distribution:

>>> director.value_counts(normalize=True)
Steven Spielberg    0.005401
Woody Allen         0.004570
Clint Eastwood      0.004155
Martin Scorsese     0.004155
Ridley Scott        0.003324
                      ...
Eric England        0.000208
Moustapha Akkad     0.000208
Jay Oliva           0.000208
Scott Speer         0.000208
Leon Ford           0.000208
Name: director_name, Length: 2397, dtype: float64

In this recipe, we determined that there were missing values in the Series by observing that the result from the .count method did not match the .size attribute. A more direct approach is to inspect the .hasnans attribute:

>>> director.hasnans
True

There exists a complement of .isna; the .notna method, which returns True for all the non-missing values:

>>> director.notna()
0        True
1        True
2        True
3        True
4        True
        ...  
4911     True
4912    False
4913     True
4914     True
4915     True
Name: director_name, Length: 4916, dtype: bool

There is also a .isnull method, which is an alias for .isna. I'm lazy so if I can type less while still being explicit about my intentions, then I'm all for it. Because pandas uses NaN all over the place, I prefer the spelling of .isna to .isnull. We don't ever see NULL anywhere in the pandas or Python world.

 

Series operations

There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:

>>> 5 + 9  # plus operator example. Adds 5 and 9
14

Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.

In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.

How to do it…

  1. Select the imdb_score column as a Series:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> imdb_score = movies["imdb_score"]
    >>> imdb_score
    0       7.9
    1       7.1
    2       6.8
    3       8.5
    4       7.1
           ... 
    4911    7.7
    4912    7.5
    4913    6.3
    4914    6.3
    4915    6.6
    Name: imdb_score, Length: 4916, dtype: float64
    
  2. Use the plus operator to add one to each Series element:
    >>> imdb_score + 1
    0       8.9
    1       8.1
    2       7.8
    3       9.5
    4       8.1
           ... 
    4911    8.7
    4912    8.5
    4913    7.3
    4914    7.3
    4915    7.6
    Name: imdb_score, Length: 4916, dtype: float64
    
  3. The other basic arithmetic operators, minus (-), multiplication (*), division (/), and exponentiation (**) work similarly with scalar values. In this step, we will multiply the series by 2.5:
    >>> imdb_score * 2.5
    0       19.75
    1       17.75
    2       17.00
    3       21.25
    4       17.75
            ...  
    4911    19.25
    4912    18.75
    4913    15.75
    4914    15.75
    4915    16.50
    Name: imdb_score, Length: 4916, dtype: float64
    
  4. Python uses a double slash (//) for floor division. The floor division operator truncates the result of the division. The percent sign (%) is the modulus operator, which returns the remainder after a division. The Series instances also support these operations:
    >>> imdb_score // 7
    0       1.0
    1       1.0
    2       0.0
    3       1.0
    4       1.0
           ... 
    4911    1.0
    4912    1.0
    4913    0.0
    4914    0.0
    4915    0.0
    Name: imdb_score, Length: 4916, dtype: float64
    
  5. There exist six comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and not equal to (!=). Each comparison operator turns each value in the Series to True or False based on the outcome of the condition. The result is a Boolean array, which we will see is very useful for filtering in later recipes:
    >>> imdb_score > 7
    0        True
    1        True
    2       False
    3        True
    4        True
            ...  
    4911     True
    4912     True
    4913    False
    4914    False
    4915    False
    Name: imdb_score, Length: 4916, dtype: bool
    >>> director = movies["director_name"]
    >>> director == "James Cameron"
    0        True
    1       False
    2       False
    3       False
    4       False
            ...  
    4911    False
    4912    False
    4913    False
    4914    False
    4915    False
    Name: director_name, Length: 4916, dtype: bool
    

How it works…

All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this would require a for loop to iterate through each of the items in the sequence before applying the operation. pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a new Series with the same index, but with the new values.

There's more…

All of the operators used in this recipe have method equivalents that produce the exact same result. For instance, in step 1, imdb_score + 1 can be reproduced with the .add method.

Using the method rather than the operator can be useful when we chain methods together.

Here are a few examples:

>>> imdb_score.add(1)  # imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7)  # imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

Why does pandas offer a method equivalent to these operators? By its nature, an operator only operates in exactly one manner. Methods, on the other hand, can have parameters that allow you to alter their default functionality.

Other recipes will dive into this further, but here is a small example. The .sub method performs subtraction on a Series. When you do subtraction with the - operator, missing values are ignored. However, the .sub method allows you to specify a fill_value parameter to use in place of missing values:

>>> money = pd.Series([100, 20, None])
>>> money – 15
0    85.0
1     5.0
2     NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0    85.0
1     5.0
2   -15.0
dtype: float64

Following is a table of operators and the corresponding methods:

Operator group Operator Series method name

Arithmetic

+,-,*,/,//,%,**

.add, .sub, .mul, .div, .floordiv, .mod, .pow

Comparison

<,>,<=,>=,==,!=

.lt, .gt, .le, .ge, .eq, .ne

You may be curious as to how a Python Series object, or any object for that matter, knows what to do when it encounters an operator. For example, how does the expression imdb_score * 2.5 know to multiply each element in the Series by 2.5? Python has a built-in, standardized way for objects to communicate with operators using special methods.

Special methods are what objects call internally whenever they encounter an operator. Special methods always begin and end with two underscores. Because of this, they are also called dunder methods as the method that implements the operator is surrounded by double underscores (dunder being a lazy geeky programmer way of saying "double underscores"). For instance, the special method .__mul__ is called whenever the multiplication operator is used. Python interprets the imdb_score * 2.5 expression as imdb_score.__mul__(2.5).

There is no difference between using the special method and using an operator as they are doing the exact same thing. The operator is just syntactic sugar for the special method. However, calling the .mul method is different than calling the .__mul__ method.

 

Chaining Series methods

In Python, every variable points to an object, and many attributes and methods return new objects. This allows sequential invocation of methods using attribute access. This is called method chaining or flow programming. pandas is a library that lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called.

To motivate method chaining, let's take an English sentence and translate the chain of events into a chain of methods. Consider the sentence: A person drives to the store to buy food, then drives home and prepares, cooks, serves, and eats the food before cleaning the dishes.

A Python version of this sentence might take the following form:

(person.drive('store')
       .buy('food')
       .drive('home')
       .prepare('food')
       .cook('food')
       .serve('food')
       .eat('food')
       .cleanup('dishes')
)

In the preceding code, the person is the object (or instance of a class) that calls a method. Each method returns another instance that allows the chain of calls to happen. The parameter passed to each of the methods specifies how the method operates.

Although it is possible to write the entire method chain in a single unbroken line, it is far more palatable to write a single method per line. Since Python does not normally allow a single expression to be written on multiple lines, we have a couple of options. My preferred style is to wrap everything in parentheses. Alternatively, you may end each line with a backslash (\) to indicate that the line continues on the next line. To improve readability even more, you can align the method calls vertically.

This recipe shows a similar method chaining using a pandas Series.

How to do it…

  1. Load in the movie dataset, and pull two columns out of it:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> fb_likes = movies["actor_1_facebook_likes"]
    >>> director = movies["director_name"]
    
  2. Two of the most common methods to append to the end of a chain are the .head or the .sample method. This suppresses long output. If the resultant DataFrame is very wide, I like to transpose the results using the .T property. (For shorter chains, there isn't as great a need to place each method on a different line):
    >>> director.value_counts().head(3)
    Steven Spielberg    26
    Woody Allen         22
    Clint Eastwood      20
    Name: director_name, dtype: int64
    
  3. A common way to count the number of missing values is to chain the .sum method after a call to .isna:
    >>> fb_likes.isna().sum()
    7
    
  4. All the non-missing values of fb_likes should be integers as it is impossible to have a partial Facebook like. In most pandas versions, any numeric columns with missing values must have their data type as float (pandas 0.24 introduced the Int64 type, which supports missing values but is not used by default). If we fill missing values from fb_likes with zeros, we can then convert it to an integer with the .astype method:
    >>> fb_likes.dtype
    dtype('float64')
    >>> (fb_likes.fillna(0).astype(int).head())
    0     1000
    1    40000
    2    11000
    3    27000
    4      131
    Name: actor_1_facebook_likes, dtype: int64
    

How it works…

Step 2 first uses the .value_counts method to return a Series and then chains the .head method to select the first three elements. The final returned object is a Series, which could also have had more methods chained on it.

In step 3, the .isna method creates a Boolean array. pandas treats False and True as 0 and 1, so the .sum method returns the number of missing values.

Each of the three chained methods in step 4 returns a Series. It may not seem intuitive, but the .astype method returns an entirely new Series with a different data type.

There's more…

One potential downside of chaining is that debugging becomes difficult. Because none of the intermediate objects created during the method calls is stored in a variable, it can be hard to trace the exact location in the chain where it occurred.

One of the nice aspects of putting each call on its own line is that it enables debugging of more complicated commands. I typically build up these chains one method at a time, but occasionally I need to come back to previous code or tweak it slightly.

To debug this code, I start by commenting out all of the commands except the first. Then I uncomment the first chain, make sure it works, and move on to the next.

If I were debugging the previous code, I would comment out the last two method calls and make sure I knew what .fillna was doing:

>>> (
...     fb_likes.fillna(0)
...     # .astype(int)
...     # .head()
... )
0        1000.0
1       40000.0
2       11000.0
3       27000.0
4         131.0
         ...   
4911      637.0
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

Then I would uncomment the next method and ensure that it was working correctly:

>>> (
...     fb_likes.fillna(0).astype(int)
...     # .head()
... )
0        1000
1       40000
2       11000
3       27000
4         131
        ...  
4911      637
4912      841
4913        0
4914      946
4915       86
Name: actor_1_facebook_likes, Length: 4916, dtype: int64

Another option for debugging chains is to call the .pipe method to show an intermediate value. The .pipe method on a Series needs to be passed a function that accepts a Series as input and can return anything (but we want to return a Series if we want to use it in a method chain).

This function, debug_ser, will print out the value of the intermediate result:

>>> def debug_ser(ser):
...     print("BEFORE")
...     print(ser)
...     print("AFTER")
...     return ser
>>> (fb_likes.fillna(0).pipe(debug_ser).astype(int).head())
BEFORE
0        1000.0
1       40000.0
2       11000.0
3       27000.0
4         131.0
         ...   
4911      637.0
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64
AFTER
0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

If you want to create a global variable to store an intermediate value you can also use .pipe:

>>> intermediate = None
>>> def get_intermediate(ser):
...     global intermediate
...     intermediate = ser
...     return ser
>>> res = (
...     fb_likes.fillna(0)
...     .pipe(get_intermediate)
...     .astype(int)
...     .head()
... )
>>> intermediate
0        1000.0
1       40000.0
2       11000.0
3       27000.0
4         131.0
         ...   
4911      637.0
4912      841.0
4913        0.0
4914      946.0
4915       86.0
Name: actor_1_facebook_likes, Length: 4916, dtype: float64

As was mentioned at the beginning of the recipe, it is possible to use backslashes for multi line code. Step 4 may be rewritten this way:

>>> fb_likes.fillna(0)\
...    .astype(int)\
...    .head()
0     1000
1    40000
2    11000
3    27000
4      131
Name: actor_1_facebook_likes, dtype: int64

I prefer wrapping the chain with parentheses. Having to continually add trailing backslashes when you add a method to the chain is annoying.

 

Renaming column names

One of the most common operations on a DataFrame is to rename the column names. I like to rename my columns so that they are also valid Python attribute names. This means that they do not start with numbers and are lowercased alphanumerics with underscores. Good column names should also be descriptive, brief, and not clash with existing DataFrame or Series attributes.

In this recipe, the column names are renamed. The motivation for renaming is to make your code easier to understand, and also let your environment assist you. Recall that Jupyter will allow you to complete Series methods if you accessed the Series using dot notation (but will not allow method completion on index access).

How to do it…

  1. Read in the movie dataset, and make the index meaningful by setting it as the movie title:
    >>> movies = pd.read_csv("data/movie.csv")
    
  2. The renamed DataFrame method accepts dictionaries that map the old value to the new value. Let's create one for the columns:
    >>> col_map = {
    ...     "director_name": "director",
    ...     "num_critic_for_reviews": "critic_reviews",
    ... }
    
  3. Pass the dictionaries to the rename method, and assign the result to a new variable:
    >>> movies.rename(columns=col_map).head()
       color           director  ...  aspec/ratio  movie/likes
    0  Color      James Cameron  ...         1.78        33000
    1  Color     Gore Verbinski  ...         2.35            0
    2  Color         Sam Mendes  ...         2.35        85000
    3  Color  Christopher Nolan  ...         2.35       164000
    4    NaN        Doug Walker  ...          NaN            0
    

How it works…

The .rename method on a DataFrame allows for column labels to be renamed. We can rename the columns by assigning to the columns attribute. But we cannot chain on an assignment. As I keep saying, I prefer chaining because it makes our code easier to read. The next section shows an example of renaming via assignment to the .column attribute:

There's more…

In this recipe, we changed the names of the columns. You can also rename the index using the .rename method if you want to. This makes more sense if the columns are string values. So we will set the index to the movie_title column and then map those values to new ones:

>>> idx_map = {
...     "Avatar": "Ratava",
...     "Spectre": "Ertceps",
...     "Pirates of the Caribbean: At World's End": "POC",
... }
>>> col_map = {
...     "aspect_ratio": "aspect",
...     "movie_facebook_likes": "fblikes",
... }
>>> (
...     movies.set_index("movie_title")
...     .rename(index=idx_map, columns=col_map)
...     .head(3)
... )
             color   director_name  ...  aspect  fblikes
movie_title                         ...                 
Ratava       Color   James Cameron  ...    1.78    33000
POC          Color  Gore Verbinski  ...    2.35        0
Ertceps      Color      Sam Mendes  ...    2.35    85000

There are multiple ways to rename row and column labels. It is possible to reassign the index and column attributes to a Python list. This assignment works when the list has the same number of elements as the row and column labels.

The following code shows an example. We will read the data from the CSV file, and use the index_col parameter to tell pandas to use the movie_title column as the index. Then we use the .tolist method on each Index object to create a Python list of labels. We then modify three values in each of the lists and reassign them to the .index and .column attributes:

>>> movies = pd.read_csv(
...     "data/movie.csv", index_col="movie_title"
... )
>>> ids = movies.index.to_list()
>>> columns = movies.columns.to_list()
# rename the row and column labels with list assignments
>>> ids[0] = "Ratava"
>>> ids[1] = "POC"
>>> ids[2] = "Ertceps"
>>> columns[1] = "director"
>>> columns[-2] = "aspect"
>>> columns[-1] = "fblikes"
>>> movies.index = ids
>>> movies.columns = columns
>>> movies.head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

Another option is to pass a function into the .rename method. The function takes a column name and returns a new name. Assuming there are spaces and uppercases in the columns, this code will clean them up:

>>> def to_clean(val):
...     return val.strip().lower().replace(" ", "_")
>>> movies.rename(columns=to_clean).head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

In pandas code in the wild, you will also see list comprehensions used to clean up the column names. With the new cleaned up list, you can reassign the result back to the .columns attribute. Assuming there are spaces and uppercases in the columns, this code will clean them up:

>>> cols = [
...     col.strip().lower().replace(" ", "_")
...     for col in movies.columns
... ]
>>> movies.columns = cols
>>> movies.head(3)
         color        director  ...  aspect  fblikes
Ratava   Color   James Cameron  ...    1.78    33000
POC      Color  Gore Verbinski  ...    2.35        0
Ertceps  Color      Sam Mendes  ...    2.35    85000

Because this code mutates the original DataFrame, consider using the .rename method.

 

Creating and deleting columns

During data analysis, it is likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. pandas has a few different ways to add new columns to a DataFrame.

In this recipe, we create new columns in the movie dataset by using the .assign method and then delete columns with the .drop method.

How to do it…

  1. One way to create a new column is to do an index assignment. Note that this will not return a new DataFrame but mutate the existing DataFrame. If you assign the column to a scalar value, it will use that value for every cell in the column. Let's create the has_seen column in the movie dataset to indicate whether or not we have seen the movie. We will assign zero for every value. By default, new columns are appended to the end:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies["has_seen"] = 0
    
  2. While this method works and is common, as I find myself chaining methods very often, I prefer to use the .assign method instead. This will return a new DataFrame with the new column. Because it uses the parameter name as the column name, the column name must be a valid parameter name:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> idx_map = {
    ...     "Avatar": "Ratava",
    ...     "Spectre": "Ertceps",
    ...     "Pirates of the Caribbean: At World's End": "POC",
    ... }
    >>> col_map = {
    ...     "aspect_ratio": "aspect",
    ...     "movie_facebook_likes": "fblikes",
    ... }
    >>> (
    ...     movies.rename(
    ...         index=idx_map, columns=col_map
    ...     ).assign(has_seen=0)
    ... )
          color      director_name  ...  fblikes  has_seen
    0     Color      James Cameron  ...    33000         0
    1     Color     Gore Verbinski  ...        0         0
    2     Color         Sam Mendes  ...    85000         0
    3     Color  Christopher Nolan  ...   164000         0
    4       NaN        Doug Walker  ...        0         0
    ...     ...                ...  ...      ...       ...
    4911  Color        Scott Smith  ...       84         0
    4912  Color                NaN  ...    32000         0
    4913  Color   Benjamin Roberds  ...       16         0
    4914  Color        Daniel Hsia  ...      660         0
    4915  Color           Jon Gunn  ...      456         0
    
  3. There are several columns that contain data on the number of Facebook likes. Let's add up all actor and director Facebook like columns and assign them to the total_likes column. We can do this in a couple of ways.

    We can add each of the columns:

    >>> total = (
    ...     movies["actor_1_facebook_likes"]
    ...     + movies["actor_2_facebook_likes"]
    ...     + movies["actor_3_facebook_likes"]
    ...     + movies["director_facebook_likes"]
    ... )
    >>> total.head(5)
    0     2791.0
    1    46563.0
    2    11554.0
    3    95000.0
    4        NaN
    dtype: float64
    

    My preference is to use methods that we can chain, so I prefer calling .sum here. I will pass in a list of columns to select to .loc to pull out just those columns that I want to sum:

    >>> cols = [
    ...     "actor_1_facebook_likes",
    ...     "actor_2_facebook_likes",
    ...     "actor_3_facebook_likes",
    ...     "director_facebook_likes",
    ... ]
    >>> sum_col = movies.loc[:, cols].sum(axis="columns")
    >>> sum_col.head(5)
    0     2791.0
    1    46563.0
    2    11554.0
    3    95000.0
    4      274.0
    dtype: float64
    

    Then we can assign this Series to the new column. Note that when we called the + operator, the result had missing numbers (NaN), but the .sum method ignores missing numbers by default, so we get a different result:

    >>> movies.assign(total_likes=sum_col).head(5)
       color        direc/_name  ...  movie/likes  total/likes
    0  Color      James Cameron  ...        33000       2791.0
    1  Color     Gore Verbinski  ...            0      46563.0
    2  Color         Sam Mendes  ...        85000      11554.0
    3  Color  Christopher Nolan  ...       164000      95000.0
    4    NaN        Doug Walker  ...            0        274.0
    

    Another option is to pass in a function as the value of the parameter in the call to the .assign method. This function accepts a DataFrame as input and should return a Series:

    >>> def sum_likes(df):
    ...     return df[
    ...         [
    ...             c
    ...             for c in df.columns
    ...             if "like" in c
    ...             and ("actor" in c or "director" in c)
    ...         ]
    ...     ].sum(axis=1)
    >>> movies.assign(total_likes=sum_likes).head(5)
       color        direc/_name  ...  movie/likes  total/likes
    0  Color      James Cameron  ...        33000       2791.0
    1  Color     Gore Verbinski  ...            0      46563.0
    2  Color         Sam Mendes  ...        85000      11554.0
    3  Color  Christopher Nolan  ...       164000      95000.0
    4    NaN        Doug Walker  ...            0        274.0
    
  4. From the Calling Series methods recipe in this chapter, we know that this dataset contains missing values. When numeric columns are added to one another as in the preceding step using the plus operator, the result is NaN if there is any value missing. However, with the .sum method it converts NaN to zero.

    Let's check if there are missing values in our new column using both methods:

    >>> (
    ...     movies.assign(total_likes=sum_col)["total_likes"]
    ...     .isna()
    ...     .sum()
    ... )
    0
    >>> (
    ...     movies.assign(total_likes=total)["total_likes"]
    ...     .isna()
    ...     .sum()
    ... )
    122
    

    We could fill in the missing values with zero as well:

    >>> (
    ...     movies.assign(total_likes=total.fillna(0))[
    ...         "total_likes"
    ...     ]
    ...     .isna()
    ...     .sum()
    ... )
    0
    
  5. There is another column in the dataset named cast_total_facebook_likes. It would be interesting to see what percentage of this column comes from our newly created column, total_likes. Before we create our percentage column, let's do some basic data validation. We will ensure that cast_total_facebook_likes is greater than or equal to total_likes:
    >>> def cast_like_gt_actor(df):
    ...     return (
    ...         df["cast_total_facebook_likes"]
    ...         >= df["total_likes"]
    ...     )
    >>> df2 = movies.assign(
    ...     total_likes=total,
    ...     is_cast_likes_more=cast_like_gt_actor,
    ... )
    
  6. is_cast_likes_more is now a column from a Boolean array. We can check whether all the values of this column are True using the .all method:
    >>> df2["is_cast_likes_more"].all()
    False
    
  7. It turns out that there is at least one movie with more total_likes than cast_total_facebook_likes. It could be that director Facebook likes are not part of the cast total likes. Let's backtrack and delete the total_likes column. We can use the .drop method with the columns parameter to do that:
    >>> df2 = df2.drop(columns="total_likes")
    
  8. Let's recreate a Series of just the total actor likes:
    >>> actor_sum = movies[
    ...     [
    ...         c
    ...         for c in movies.columns
    ...         if "actor_" in c and "_likes" in c
    ...     ]
    ... ].sum(axis="columns")
    >>> actor_sum.head(5)
    0     2791.0
    1    46000.0
    2    11554.0
    3    73000.0
    4      143.0
    dtype: float64
    
  9. Check again whether all the values in cast_total_facebook_likes are greater than actor_sum. We can do this with the >= operator or the .ge method:
    >>> movies["cast_total_facebook_likes"] >= actor_sum
    0       True
    1       True
    2       True
    3       True
    4       True
            ... 
    4911    True
    4912    True
    4913    True
    4914    True
    4915    True
    Length: 4916, dtype: bool
    >>> movies["cast_total_facebook_likes"].ge(actor_sum)
    0       True
    1       True
    2       True
    3       True
    4       True
            ... 
    4911    True
    4912    True
    4913    True
    4914    True
    4915    True
    Length: 4916, dtype: bool
    >>> movies["cast_total_facebook_likes"].ge(actor_sum).all()
    True
    
  10. Finally, let's calculate the percentage of the cast_total_facebook_likes that come from actor_sum:
    >>> pct_like = actor_sum.div(
    ...     movies["cast_total_facebook_likes"]
    ... ).mul(100)
    
  11. Let's validate that the minimum and maximum of this Series fall between 0 and 1:
    >>> pct_like.describe()
    count    4883.000000
    mean       83.327889
    std        14.056578
    min        30.076696
    25%        73.528368
    50%        86.928884
    75%        95.477440
    max       100.000000
    dtype: float64
    
  12. We can then create a Series using the movie_title column as the index. The Series constructor lets us pass in both the values and an index:
    >>> pd.Series(
    ...     pct_like.to_numpy(), index=movies["movie_title"]
    ... ).head()
    movie_title
    Avatar                                         57.736864
    Pirates of the Caribbean: At World's End       95.139607
    Spectre                                        98.752137
    The Dark Knight Rises                          68.378310
    Star Wars: Episode VII - The Force Awakens    100.000000
    dtype: float64
    

How it works…

Many pandas operations are flexible, and column creation is one of them. This recipe assigns both a scalar value, as seen in step 1, and a Series, as seen in step 2, to create a new column.

Step 3 adds four different Series together with the plus operator and the .sum method. Step 4 uses method chaining to find and fill missing values. Step 5 uses the greater than or equal comparison operator to return a Boolean Series, which is then evaluated with the .all method in step 6 to check whether every single value is True or not.

The .drop method accepts the name of the row or column to delete. It defaults to dropping rows by the index names. To drop columns, you must set the axis parameter to either 1 or 'columns'. The default value for axis is 0 or 'index'.

Steps 8 and 9 redo the work of step 3 to step 6 without the total_likes column. Step 10 finally calculates the desired column we wanted since step 4. Step 11 validates that the percentages are between 0 and 100.

There's more…

It is possible to insert a new column into a specific location in a DataFrame with the .insert method. The .insert method takes the integer position of the new column as its first argument, the name of the new column as its second, and the values as its third. You will need to use the .get_loc Index method to find the integer location of the column name.

The .insert method modifies the calling DataFrame in-place, so there won't be an assignment statement. It also returns None. For this reason, I prefer the .assign method to create new columns. If I need them in order, I can pass in an ordered list of columns into the index operator (or to .loc).

The profit of each movie is calculated by subtracting budget from gross and inserting it after gross with the following:

>>> profit_index = movies.columns.get_loc("gross") + 1
>>> profit_index
9
>>> movies.insert(
...     loc=profit_index,
...     column="profit",
...     value=movies["gross"] - movies["budget"],
... )

An alternative to deleting columns with the .drop method is to use the del statement. This also does not return a new DataFrame, so favor .drop over this:

>>> del movies["director_name"]
About the Authors
  • Matt Harrison

    Matt Harrison has been using Python since 2000. He runs MetaSnake, which provides corporate training for Python and Data Science. He is the author of Machine Learning Pocket Reference, the bestselling Illustrated Guide to Python 3, and Learning the Pandas Library, among other books.

    Browse publications by this author
  • Theodore Petrou

    Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.

    Browse publications by this author
Pandas 1.x Cookbook - Second Edition
Unlock this book and the full library FREE for 7 days
Start now