Reader small image

You're reading from  Time Series Analysis with Python Cookbook

Product typeBook
Published inJun 2022
PublisherPackt
ISBN-139781801075541
Edition1st Edition
Concepts
Right arrow
Author (1)
Tarek A. Atwan
Tarek A. Atwan
author image
Tarek A. Atwan

Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at various universities in the United States. He is regarded as a data science mentor and advisor, working with executive leaders in numerous industries to solve complex problems using a data-driven approach.
Read more about Tarek A. Atwan

Right arrow

Chapter 7: Handling Missing Data

As a data scientist, data analyst, or business analyst, you have probably discovered that obtaining a perfect clean dataset is too optimistic. What is more common, though, is that the data you are working with suffers from flaws such as missing values, erroneous data, duplicate records, insufficient data, or the presence of outliers in the data.

Time series data is no different, and before plugging the data into any analysis or modeling workflow, you must investigate the data first. It is vital to understand the business context around the time series data to detect and identify these problems successfully. For example, if you work with stock data, the context is very different from COVID data or sensor data.

Having that intuition or domain knowledge will allow you to anticipate what to expect and what is considered acceptable when analyzing the data. Always try to understand the business context around the data. For example, why is the data...

Technical requirements

You can download the Jupyter notebooks and the requisite datasets from the GitHub repository to follow along:

In this chapter and beyond, you will extensively use pandas 1.4.2 (released April 2, 2022). There will be four additional libraries that you will be using:

  • NumPy (≥ 1.20.3)
  • Matplotlib (≥ 3.5.0)
  • statsmodels (≥ 0.11.0)
  • scikit-learn (≥ 1.0.1)
  • SciPy (≥ 1.7.1)

If you are using pip, then you can install these packages from your terminal with the following command:

pip install matplotlib numpy statsmodels scikit-learn scipy

If you are using conda, then you can install these packages with the following command:

conda install matplotlib...

Understanding missing data

Data can be missing for a variety of reasons, such as unexpected power outages, a device that got accidentally unplugged, a sensor that just became defective, a survey respondent declined to answer a question, or the data was intentionally removed for privacy and compliance reasons. In other words, missing data is inevitable.

Generally, missing data is very common, yet sometimes it is not given the proper level of attention in terms of formulating a strategy on how to handle the situation. One approach for handling rows with missing data is to drop those observations (delete the rows). However, this may not be a good strategy if you have limited data in the first place, for example, if collecting the data is a complex and expensive process. Additionally, the drawback of deleting records, if done prematurely, is that you will not know if the missing data was due to censoring (an observation is only partially collected) or due to bias (for example, high...

Performing data quality checks

Missing data are values not captured or observed in the dataset. Values can be missing for a particular feature (column), or an entire observation (row). When ingesting the data using pandas, missing values will show up as either NaN, NaT, or NA.

Sometimes, missing observations are replaced with other values in the source system; for example, this can be a numeric filler such as 99999 or 0, or a string such as missing or N/A. When missing values are represented by 0, you need to be cautious and investigate further to determine whether those zero values are legitimate or they are indicative of missing data.

In this recipe, you will explore how to identify the presence of missing data.

Getting ready

You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.

You will be using two datasets from the Ch7 folder: clicks_missing_multiple.csv and...

Handling missing data with univariate imputation using pandas

Generally, there are two approaches to imputing missing data: univariate imputation and multivariate imputation. This recipe will explore univariate imputation techniques available in pandas.

In univariate imputation, you use non-missing values in a single variable (think a column or feature) to impute the missing values for that variable. For example, if you have a sales column in the dataset with some missing values, you can use a univariate imputation method to impute missing sales observations using average sales. Here, a single column (sales) was used to calculate the mean (from non-missing values) for imputation.

Some basic univariate imputation techniques include the following:

  • Imputing using the mean.
  • Imputing using the last observation forward (forward fill). This can be referred to as Last Observation Carried Forward (LOCF).
  • Imputing using the next observation backward (backward fill). This...

Handling missing data with univariate imputation using scikit-learn

scikit-learn is a very popular machine learning library in Python. The scikit-learn library offers a plethora of options for everyday machine learning tasks and algorithms such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Additionally, the library offers multiple options for univariate and multivariate data imputation.

Getting ready

You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.

This recipe will utilize the three functions prepared earlier (read_dataset, rmse_score, and plot_dfs). You will be using four datasets from the Ch7 folder: clicks_original.csv, clicks_missing.csv, clicks_original.csv, and co2_missing_only.csv. The datasets are available from the GitHub repository.

How to do it…

You will start by importing the libraries...

Handling missing data with multivariate imputation

Earlier, we discussed the fact that there are two approaches to imputing missing data: univariate imputation and multivariate imputation.

As you have seen in the previous recipes, univariate imputation involves using one variable (column) to substitute for the missing data, disregarding other variables in the dataset. Univariate imputation techniques are usually faster and simpler to implement, but a multivariate approach may produce better results in most situations.

Instead of using a single variable (column), in a multivariate imputation, the method uses multiple variables within the dataset to impute missing values. The idea is simple: Have more variables within the dataset chime in to improve the predictability of missing values.

In other words, univariate imputation methods handle missing values for a particular variable in isolation of the entire dataset and just focus on that variable to derive the estimates. In...

Handling missing data with interpolation

Another commonly used technique for imputing missing values is interpolation. The pandas library provides the DataFrame.interpolate() method for more complex univariate imputation strategies.

For example, one of the interpolation methods available is linear interpolation. Linear interpolation can be used to impute missing data by drawing a straight line between the two points surrounding the missing value (in time series, this means for a missing data point, it looks at a prior past value and the next future value to draw a line between them). A polynomial interpolation, on the other hand, will attempt to draw a curved line between the two points. Hence, each method will have a different mathematical operation to determine how to fill in for the missing data.

The interpolation capabilities in pandas can be extended further through the SciPy library, which offers additional univariate and multivariate interpolations.

In this recipe...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Time Series Analysis with Python Cookbook
Published in: Jun 2022Publisher: PacktISBN-13: 9781801075541
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Tarek A. Atwan

Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at various universities in the United States. He is regarded as a data science mentor and advisor, working with executive leaders in numerous industries to solve complex problems using a data-driven approach.
Read more about Tarek A. Atwan