Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Ordering column names

One of the first tasks to consider after initially importing a dataset as a DataFrame is to analyze the order of the columns. As humans we are used to reading languages from left to right, which impacts our interpretations of the data. It's far easier to find and interpret information when column order is given consideration.

There are no standardized set of rules that dictate how columns should be organized within a dataset. However, it is good practice to develop a set of guidelines that you consistently follow. This is especially true if you work with a group of analysts who share lots of datasets.

The following is a guideline to order columns:

  • Classify each column as either categorical or continuous
  • Group common columns within the categorical and continuous columns
  • Place the most important groups of columns first with categorical columns before continuous ones

This recipe shows you how to order the columns with this guideline. There are many possible orderings that are sensible.

How to do it...

  1. Read in the movie dataset, and scan the data:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    
  2. Output all the column names and scan for similar categorical and continuous columns:
    >>> movies.columns
    Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
           'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
           'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
           'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
           'movie_fb'],
          dtype='object')
    
  3. The columns don't appear to have any logical ordering to them. Organize the names sensibly into lists so that the guideline from the previous section is followed:
    >>> cat_core = [
    ...     "movie_title",
    ...     "title_year",
    ...     "content_rating",
    ...     "genres",
    ... ]
    >>> cat_people = [
    ...     "director_name",
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ... ]
    >>> cat_other = [
    ...     "color",
    ...     "country",
    ...     "language",
    ...     "plot_keywords",
    ...     "movie_imdb_link",
    ... ]
    >>> cont_fb = [
    ...     "director_fb",
    ...     "actor_1_fb",
    ...     "actor_2_fb",
    ...     "actor_3_fb",
    ...     "cast_total_fb",
    ...     "movie_fb",
    ... ]
    >>> cont_finance = ["budget", "gross"]
    >>> cont_num_reviews = [
    ...     "num_voted_users",
    ...     "num_user",
    ...     "num_critic",
    ... ]
    >>> cont_other = [
    ...     "imdb_score",
    ...     "duration",
    ...     "aspect_ratio",
    ...     "facenumber_in_poster",
    ... ]
    
  4. Concatenate all the lists together to get the final column order. Also, ensure that this list contains all the columns from the original:
    >>> new_col_order = (
    ...     cat_core
    ...     + cat_people
    ...     + cat_other
    ...     + cont_fb
    ...     + cont_finance
    ...     + cont_num_reviews
    ...     + cont_other
    ... )
    >>> set(movies.columns) == set(new_col_order)
    True
    
  5. Pass the list with the new column order to the indexing operator of the DataFrame to reorder the columns:
    >>> movies[new_col_order].head()
       movie_title  title_year  ... aspect_ratio facenumber_in_poster
    0       Avatar      2009.0  ...         1.78          0.0
    1  Pirates ...      2007.0  ...         2.35          0.0
    2      Spectre      2015.0  ...         2.35          1.0
    3  The Dark...      2012.0  ...         2.35          0.0
    4  Star War...         NaN  ...          NaN          0.0
    

How it works...

You can select a subset of columns from a DataFrame, with a list of specific column names. For instance, movies[['movie_title', 'director_name']] creates a new DataFrame with only the movie_title and director_name columns. Selecting columns by name is the default behavior of the index operator for a pandas DataFrame.

Step 3 neatly organizes all of the column names into separate lists based on their type (categorical or continuous) and by how similar their data is. The most important columns, such as the title of the movie, are placed first.

Step 4 concatenates all of the lists of column names and validates that this new list contains the same exact values as the original column names. Python sets are unordered and the equality statement checks whether each member of one set is a member of the other. Manually ordering columns in this recipe is susceptible to human error as it's easy to mistakenly forget a column in the new column list.

Step 5 completes the reordering by passing the new column order as a list to the indexing operator. This new order is now much more sensible than the original.

There's more...

There are alternative guidelines for ordering columns besides the suggestion mentioned earlier. Hadley Wickham's seminal paper on Tidy Data suggests placing the fixed variables first, followed by measured variables. As this data does not come from a controlled experiment, there is some flexibility in determining which variables are fixed and which ones are measured. Good candidates for measured variables are those that we would like to predict, such as gross, the budget, or the imdb_score. For instance, in this ordering, we can mix categorical and continuous variables. It might make more sense to place the column for the number of Facebook likes directly after the name of that actor. You can, of course, come up with your own guidelines for column order as the computational parts are unaffected by it.

Previous PageNext Page
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou