Reader small image

You're reading from  Pandas 1.x Cookbook - Second Edition

Product typeBook
Published inFeb 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839213106
Edition2nd Edition
Languages
Tools
Right arrow
Authors (2):
Matt Harrison
Matt Harrison
author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

Theodore Petrou
Theodore Petrou
author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou

View More author details
Right arrow

Introduction

The roots of pandas lay in analyzing financial time series data. Time series are points of data gathered over time. Generally, the time is evenly spaced between each data point. However, there may be gaps in the observations. pandas includes functionality to manipulate dates, aggregate over different time periods, sample different periods of time, and more.

Understanding the difference between Python and pandas date tools

Before we get to pandas, it can help to be aware of and understand core Python's date and time functionality. The datetime module provides three data types: date, time, and datetime. Formally, a date is a moment in time consisting of just the year, month, and day. For instance, June 7, 2013 would be a date. A time consists of hours, minutes, seconds, and microseconds (one-millionth of a second) and is unattached to any date. An example of time would be 12 hours and 30 minutes. A datetime consists of both the elements of a date and time together.

On the other hand, pandas has a single object to encapsulate date and time called a Timestamp. It has nanosecond (one-billionth of a second) precision and is derived from NumPy's datetime64 data type. Both Python and pandas each have a timedelta object that is useful when doing date addition and subtraction.

In this recipe, we will first explore...

Slicing time series intelligently

DataFrame selection and slicing was covered previously. When the DataFrame has a DatetimeIndex, even more opportunities arise for selection and slicing.

In this recipe, we will use partial date matching to select and slice a DataFrame with a DatetimeIndex.

How to do it…

  1. Read in the Denver crimes dataset from the hdf5 file crimes.h5, and output the column data types and the first few rows. The hdf5 file format allows efficient storage of large amounts of data and is different from a CSV text file:
    >>> crime = pd.read_hdf('data/crime.h5', 'crime')
    >>> crime.dtypes
    OFFENSE_TYPE_ID              category
    OFFENSE_CATEGORY_ID          category
    REPORTED_DATE          datetime64[ns]
    GEO_LON                       float64
    GEO_LAT                       float64
    NEIGHBORHOOD_ID              category
    IS_CRIME                        int64
    IS_TRAFFIC                      int64
    dtype...

Filtering columns with time data

The last section showed how to filter data that has a DatetimeIndex. Often, you will have columns with dates in them, and it does not make sense to have that column be the index. In this section, we will reproduce the slicing of the preceding section with columns. Sadly, the slicing constructs do not work on columns, so we will have to take a different tack.

How to do it…

  1. Read in the Denver crimes dataset from the hdf5 file crimes.h5 and inspect the column types:
    >>> crime = pd.read_hdf('data/crime.h5', 'crime')
    >>> crime.dtypes
    OFFENSE_TYPE_ID              category
    OFFENSE_CATEGORY_ID          category
    REPORTED_DATE          datetime64[ns]
    GEO_LON                       float64
    GEO_LAT                       float64
    NEIGHBORHOOD_ID              category
    IS_CRIME                        int64
    IS_TRAFFIC                      int64
    dtype: object
    
  2. Select all the rows where...

Using methods that only work with a DatetimeIndex

There are a number of DataFrame and Series methods that only work with a DatetimeIndex. If the index is of any other type, these methods will fail.

In this recipe, we will first use methods to select rows of data by their time component. We will then learn about the powerful DateOffset objects and their aliases.

How to do it…

  1. Read in the crime hdf5 dataset, set the index as REPORTED_DATE, and ensure that we have a DatetimeIndex:
    >>> crime = (pd.read_hdf('data/crime.h5', 'crime') 
    ...     .set_index('REPORTED_DATE')
    ... )
    >>> type(crime.index)
    <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
    
  2. Use the .between_time method to select all crimes that occurred between 2 A.M. and 5 A.M., regardless of the date:
    >>> crime.between_time('2:00', '5:00', include_end=False)
                  ...

Counting the number of weekly crimes

The Denver crime dataset is huge, with over 460,000 rows each marked with a reported date. Counting the number of weekly crimes is one of many queries that can be answered by grouping according to some period of time. The .resample method provides an easy interface to grouping by any possible span of time.

In this recipe, we will use both the .resample and .groupby methods to count the number of weekly crimes.

How to do it…

  1. Read in the crime hdf5 dataset, set the index as the REPORTED_DATE, and then sort it to increase performance for the rest of the recipe:
    >>> crime_sort = (pd.read_hdf('data/crime.h5', 'crime') 
    ...     .set_index('REPORTED_DATE') 
    ...     .sort_index()
    ... )
    
  2. To count the number of crimes per week, we need to form a group for each week. The .resample method takes a DateOffset object or alias and returns an object ready to perform an action...

Aggregating weekly crime and traffic accidents separately

The Denver crime dataset has all crime and traffic accidents together in one table, and separates them through the binary columns: IS_CRIME and IS_TRAFFIC. The .resample method allows you to group by a period of time and aggregate specific columns separately.

In this recipe, we will use the .resample method to group by each quarter of the year and then sum up the number of crimes and traffic accidents separately.

How to do it…

  1. Read in the crime hdf5 dataset, set the index as REPORTED_DATE, and then sort it to increase performance for the rest of the recipe:
    >>> crime = (pd.read_hdf('data/crime.h5', 'crime') 
    ...     .set_index('REPORTED_DATE') 
    ...     .sort_index()
    ... )
    
  2. Use the .resample method to group by each quarter of the year and then sum the IS_CRIME and IS_TRAFFIC columns for each group:
    >>> (crime
    ...     .resample...

Measuring crime by weekday and year

Measuring crimes by weekday and by year simultaneously requires the functionality to pull this information from a Timestamp. Thankfully, this functionality is built into any Timestamp column with the .dt attribute.

In this recipe, we will use the .dt attribute to provide us with both the weekday name and year of each crime as a Series. We count all of the crimes by forming groups using both of these Series. Finally, we adjust the data to consider partial years and population before creating a heatmap of the total amount of crime.

How to do it…

  1. Read in the Denver crime hdf5 dataset leaving the REPORTED_DATE as a column:
    >>> crime = pd.read_hdf('data/crime.h5', 'crime')
    >>> crime
                             OFFEN/PE_ID  ... IS_TRAFFIC
    0          traffic-accident-dui-duid  ...          1
    1         vehicular-eluding-no-chase  ...          0
    2               disturbing-the-peace  .....

Grouping with anonymous functions with a DatetimeIndex

Using DataFrames with a DatetimeIndex opens the door to many new and different operations as seen with several recipes in this chapter.

In this recipe, we will show the versatility of using the .groupby method for DataFrames that have a DatetimeIndex.

How to do it…

  1. Read in the Denver crime hdf5 file, place the REPORTED_DATE column in the index, and sort it:
    >>> crime = (pd.read_hdf('data/crime.h5', 'crime') 
    ...    .set_index('REPORTED_DATE') 
    ...    .sort_index()
    ... )
    
  2. The DatetimeIndex has many of the same attributes and methods as a pandas Timestamp. Let's take a look at some that they have in common:
    >>> common_attrs = (set(dir(crime.index)) & 
    ...     set(dir(pd.Timestamp)))
    >>> [attr for attr in common_attrs if attr[0] != '_']
    ['tz_convert', 'is_month_start', 'nanosecond...

Grouping by a Timestamp and another column

The .resample method is unable to group by anything other than periods of time. The .groupby method, however, has the ability to group by both periods of time and other columns.

In this recipe, we will show two very similar but different approaches to group by Timestamps and another column.

How to do it…

  1. Read in the employee dataset, and create a DatetimeIndex with the HIRE_DATE column:
    >>> employee = pd.read_csv('data/employee.csv',
    ...     parse_dates=['JOB_DATE', 'HIRE_DATE'],
    ...     index_col='HIRE_DATE')
    >>> employee
                UNIQUE_ID  ...   JOB_DATE
    HIRE_DATE              ...
    2006-06-12          0  ... 2012-10-13
    2000-07-19          1  ... 2010-09-18
    2015-02-03          2  ... 2015-02-03
    1982-02-08          3  ... 1991-05-25
    1989-06-19          4  ... 1994-10-22
    ...               ...  ...        ...
    2014-06-09       1995  ... 2015-06-09...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020Publisher: PacktISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Matt Harrison

Matt Harrison is an author, speaker, corporate trainer, and consultant. He authored the popular Learning the Pandas Library and Illustrated Guide to Python 3. He runs MetaSnake, which provides corporate and online training on Python and Data Science. In addition, he offers consulting services. He has worked on search engines, configuration management, storage, BI, predictive modeling, and in a variety of domains.
Read more about Matt Harrison

author image
Theodore Petrou

Theodore Petrou is the founder of Dunder Data, a training company dedicated to helping teach the Python data science ecosystem effectively to individuals and corporations. Read his tutorials and attempt his data science challenges at the Dunder Data website.
Read more about Theodore Petrou