Reader small image

You're reading from  Modern Time Series Forecasting with Python

Product typeBook
Published inNov 2022
PublisherPackt
ISBN-139781803246802
Edition1st Edition
Concepts
Right arrow
Author (1)
Manu Joseph
Manu Joseph
author image
Manu Joseph

Manu Joseph is a self-made data scientist with more than a decade of experience working with many Fortune 500 companies enabling digital and AI transformations, specifically in machine learning-based demand forecasting. He is considered an expert, thought leader, and strong voice in the world of time series forecasting. Currently, Manu leads applied research at Thoucentric, where he advances research by bringing cutting-edge AI technologies to the industry. He is also an active open-source contributor and developed an open-source library—PyTorch Tabular—which makes deep learning for tabular data easy and accessible. Originally from Thiruvananthapuram, India, Manu currently resides in Bengaluru, India, with his wife and son
Read more about Manu Joseph

Right arrow

Acquiring and Processing Time Series Data

In the previous chapter, we learned what a time series is and established a few standard notations and terminologies. Now, let’s switch tracks from theory to practice. In this chapter, we are going to get our hands dirty and start working with data. Although we said time series data is everywhere, we are still yet to get our hands dirty with a few time series datasets. We are going to start working on the dataset we have chosen to work on throughout this book, process it in the right way, and learn about a few techniques for dealing with missing values.

In this chapter, we will cover the following topics:

  • Understanding the time series dataset
  • pandas datetime operations, indexing, and slicing – a refresher
  • Handling missing data
  • Mapping additional information
  • Saving and loading files to disk
  • Handling longer periods of missing data

Technical requirements

You will need to set up the Anaconda environment following the instructions in the Preface of the book to get a working environment with all the packages and datasets required for the code in this book.

The code for this chapter can be found at https://github.com/PacktPublishing/Modern-Time-Series-Forecasting-with-Python-/tree/main/notebooks/Chapter02.

Handling time series data is like handling other tabular datasets, but with a focus on the temporal dimension. As with any tabular dataset, pandas is perfectly equipped to handle time series data as well.

Let’s start getting our hands dirty and work through a dataset from the beginning. We are going to use the London Smart Meters dataset throughout this book. If you have not downloaded the data already as part of the environment setup, go to the Preface and do that now.

Understanding the time series dataset

This is the key first step in any new dataset you come across, even before Exploratory Data Analysis (EDA), which we will be covering in Chapter 3, Analyzing and Visualizing Time Series Data. Understanding where the data is coming from, the data generating process behind it, and the source domain is essential to having a good understanding of the dataset.

London Data Store, a free and open data-sharing portal, provided this dataset, which was collected and enriched by Jean-Michel D and uploaded on Kaggle.

The dataset contains energy consumption readings for a sample of 5,567 London households that took part in the UK Power Networks-led Low Carbon London project between November 2011 and February 2014. Readings were taken at half-hourly intervals. Some metadata about the households is also available as part of the dataset. Let’s look at what metadata is available as part of the dataset:

  • CACI UK segmented the UK’s population...

pandas datetime operations, indexing, and slicing 
– a refresher

Instead of using our dataset, which is slightly complex, let’s pick an easy, well-formatted stock exchange price dataset from the UCI Machine Learning Repository and look at the functionality of pandas:

df = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00247/data_akbilgic.xlsx", skiprows=1)

The DataFrame that we read looks as follows:

Figure 2.2 – The DataFrame with stock exchange prices

Now that we have read the DataFrame, let’s start manipulating it.

Converting the date columns into pd.Timestamp/DatetimeIndex

First, we must convert the date column (which may not always be parsed as dates automatically by pandas) into pandas datetime. For that, pandas has a handy function called pd.to_datetime. It infers the datetime format automatically and converts the input into a pd.Timestamp, if the input is a string, or...

Handling missing data

While dealing with large datasets in the wild, you are bound to encounter missing data. If it is not part of the time series, it may be part of the additional information you collect and map. Before we jump the gun and fill it with a mean value or drop those rows, let’s think about a few aspects:

  • The first consideration should be whether the missing data we are worried about is missing or not. For that, we need to think about the Data Generating Process (DGP) (the process that is generating the time series). As an example, let’s look at sales at a local supermarket. You have been given the point-of-sale (POS) transactions for the last 2 years and you are processing the data into a time series. While analyzing the data, you found that there are a few products where there aren’t any transactions for a few days. Now, what you need to think about is whether the missing data is missing or whether there is some information that this missingness...

Mapping additional information

From the data model that we prepared earlier, we know that there are three key files that we have to map: Household Information, Weather, and Bank Holidays.

The informations_households.csv file contains metadata about the household. There are static features that are not dependent on time. For this, we just need to left merge informations_households.csv to the compact form based on LCLid, which is the time series identifier.

Best practice

While doing a pandas merge, one of the most common and unexpected outcomes is that the number of rows before and after the operation is not the same (even if you are doing a left merge). This typically happens because there are duplicates in the keys on which you are merging. As a best practice, you can use the validate parameter in the pandas merge, which takes in inputs such as one_to_one and many_to_one so that this check is done while merging and will throw an error if the assumption is not met. For more...

Saving and loading files to disk

The fully merged DataFrame in its compact form takes up only ~10 MB. But saving this file requires a little bit of engineering. If we try to save the file in CSV format, it will not work because of the way we have stored arrays in pandas columns (since the data is in its compact form). We can save it in pickle or parquet format, or any of the binary forms of file storage. This can work, depending on the size of the RAM available in our machines. Although the fully merged DataFrame is just ~10 MB, saving it in pickle format will make the size explode to ~15 GB.

What we can do is save this as a text file while making a few tweaks to accommodate the column names, column types, and other metadata that is required to read the file back into memory. The resulting file size on disk still comes out to ~15 GB but since we are doing it as an I/O operation, we are not keeping all that data in our memory. We call this the time series (.ts) format. The functions...

Handling longer periods of missing data

We saw some techniques for handling missing data earlier – forward and backward filling, interpolation, and so on. Those techniques usually work if there are one or two missing data points. But if a large section of data is missing, then these simple techniques fall short.

Notebook alert

To follow along with the complete code for missing data imputation, use the 03 - Handling Missing Data (Long Gaps).ipynb notebook in the chapter02 folder.

Let’s read blocks 0-7 parquet from memory:

block_df = pd.read_parquet("data/london_smart_meters/preprocessed/london_smart_meters_merged_block_0-7.parquet")

The data that we have saved is in compact form. We need to convert it into expanded form because it is easier to work with time series data in that form. Since we only need a subset of the time series (for faster demonstration purposes), we are just extracting one block from these seven blocks. To convert compact form...

Summary

After a short refresher on pandas DataFrames, especially on the datetime manipulations and simple techniques for handling missing data, we learned about the two forms of storing and working with time series data – compact and expanded. With all this knowledge, we took our raw dataset and built a pipeline to convert it into compact form. If you have run the accompanying notebook, you should have the preprocessed dataset saved on disk. We also had an in-depth look at some techniques for handling long gaps of missing data.

Now that we have the processed datasets, in the next chapter, we will learn how to visualize and analyze a time series dataset.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Time Series Forecasting with Python
Published in: Nov 2022Publisher: PacktISBN-13: 9781803246802
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Manu Joseph

Manu Joseph is a self-made data scientist with more than a decade of experience working with many Fortune 500 companies enabling digital and AI transformations, specifically in machine learning-based demand forecasting. He is considered an expert, thought leader, and strong voice in the world of time series forecasting. Currently, Manu leads applied research at Thoucentric, where he advances research by bringing cutting-edge AI technologies to the industry. He is also an active open-source contributor and developed an open-source library—PyTorch Tabular—which makes deep learning for tabular data easy and accessible. Originally from Thiruvananthapuram, India, Manu currently resides in Bengaluru, India, with his wife and son
Read more about Manu Joseph