Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

It is important to consider the steps that you, as an analyst, take when you first encounter a dataset after importing it into your workspace as a DataFrame. Is there a set of tasks that you usually undertake to examine the data? Are you aware of all the possible data types? This chapter begins by covering the tasks you might want to undertake when first encountering a new dataset. The chapter proceeds by answering common questions about things that are not that simple to do in pandas.

Developing a data analysis routine

Although there is no standard approach when beginning a data analysis, it is typically a good idea to develop a routine for yourself when first examining a dataset. Similar to everyday routines that we have for waking up, showering, going to work, eating, and so on, a data analysis routine helps you to quickly get acquainted with a new dataset. This routine can manifest itself as a dynamic checklist of tasks that evolves as your familiarity with pandas and data analysis expands.

Exploratory Data Analysis (EDA) is a term used to describe the process of analyzing datasets. Typically it does not involve model creation, but summarizing the characteristics of the data and visualizing them. This is not new and was promoted by John Tukey in his book Exploratory Data Analysis in 1977.

Many of these same processes are still applicable and useful to understand a dataset. Indeed, they can also help with creating machine learning models later...

Data dictionaries

A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.

A data dictionary for the college dataset is provided in the following college_data_dictionary.csv file:

>>> pd.read_csv("data/college_data_dictionary.csv")
    column_name  description
0        INSTNM  Institut...
1          CITY  City Loc...
2        STABBR  State Ab...
3          HBCU  Historic...
4       MENONLY  0/1 Men ...
..          ...          ...
22      PCTPELL  Percent ...
23     PCTFLOAN  Percent ...
24      UG25ABV  Percent ...
25  MD_EARN_...  Median E...
26  GRAD_DEB...  Median d...

As you can see, it is immensely helpful in deciphering...

Reducing memory by changing data types

pandas has precise technical definitions for many data types. However, when you load data from type-less formats such as CSV, pandas has to infer the type.

This recipe changes the data type of one of the object columns from the college dataset to the special pandas categorical data type to drastically reduce its memory usage.

How to do it…

  1. After reading in our college dataset, we select a few columns of different data types that will clearly show how much memory may be saved:
    >>> college = pd.read_csv("data/college.csv")
    >>> different_cols = [
    ...     "RELAFFIL",
    ...     "SATMTMID",
    ...     "CURROPER",
    ...     "INSTNM",
    ...     "STABBR",
    ... ]
    >>> col2 = college.loc[:, different_cols]
    >>> col2.head()
       RELAFFIL  SATMTMID  ...       INSTNM STABBR
    0         0     420.0  ...  Alabama ...     AL
    1         0     565.0...

Selecting the smallest of the largest

This recipe can be used to create catchy news headlines such as Out of the Top 100 Universities, These 5 have the Lowest Tuition, or From the Top 50 Cities to Live, these 10 are the Most Affordable.

During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column.

In this recipe, we find the five lowest budget movies from the top 100 scoring movies by taking advantage of the convenience methods: .nlargest and .nsmallest.

How to do it…

  1. Read in the movie dataset, and select the columns: movie_title, imdb_score, and budget:
    >>> movie = pd.read_csv("data/movie.csv")
    >>> movie2 = movie[["movie_title", "imdb_score", "budget"]]
    >>> movie2.head()
       movie_title  imdb_score       budget
    0       Avatar    ...

Selecting the largest of each group by sorting

One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. For instance, this would be like finding the highest-rated film of each year or the highest-grossing film by content rating. To accomplish this task, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.

In this recipe, we will find the highest-rated film of each year.

How to do it…

  1. Read in the movie dataset and slim it down to just the three columns we care about: movie_title, title_year, and imdb_score:
    >>> movie = pd.read_csv("data/movie.csv")
    >>> movie[["movie_title", "title_year", "imdb_score"]]
                                         movie_title  ...
    0                                         Avatar  ...
    1 ...

Replicating nlargest with sort_values

The previous two recipes work similarly by sorting values in slightly different manners. Finding the top n values of a column of data is equivalent to sorting the entire column in descending order and taking the first n values. pandas has many operations that are capable of doing this in a variety of ways.

In this recipe, we will replicate the Selecting the smallest of the largest recipe with the .sort_values method and explore the differences between the two.

How to do it…

  1. Let's recreate the result from the final step of the Selecting the smallest of the largest recipe:
    >>> movie = pd.read_csv("data/movie.csv")
    >>> (
    ...     movie[["movie_title", "imdb_score", "budget"]]
    ...     .nlargest(100, "imdb_score")
    ...     .nsmallest(5, "budget")
    ... )
                   movie_title  imdb_score    budget
    4804        Butterfly Girl   ...

Calculating a trailing stop order price

There are many strategies to trade stocks. One basic type of trade that many investors employ is the stop order. A stop order is an order placed by an investor to buy or sell a stock that executes whenever the market price reaches a certain point. Stop orders are useful to both prevent huge losses and protect gains.

For this recipe, we will only be examining stop orders used to sell currently owned stocks. In a typical stop order, the price does not change throughout the lifetime of the order. For instance, if you purchased a stock for $100 per share, you might want to set a stop order at $90 per share to limit your downside to 10%.

A more advanced strategy would be to continually modify the sale price of the stop order to track the value of the stock if it increases in value. This is called a trailing stop order. Concretely, if the same $100 stock increases to $120, then a trailing stop order 10% below the current market value...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou