You're reading from Time Series Analysis with Python Cookbook

Product typeBook

Published inJun 2022

PublisherPackt

ISBN-139781801075541

Edition1st Edition

Concepts

Statistics

Author (1)

Tarek A. Atwan

Chapter 7: Handling Missing Data

As a data scientist, data analyst, or business analyst, you have probably discovered that obtaining a perfect clean dataset is too optimistic. What is more common, though, is that the data you are working with suffers from flaws such as missing values, erroneous data, duplicate records, insufficient data, or the presence of outliers in the data.

Time series data is no different, and before plugging the data into any analysis or modeling workflow, you must investigate the data first. It is vital to understand the business context around the time series data to detect and identify these problems successfully. For example, if you work with stock data, the context is very different from COVID data or sensor data.

Having that intuition or domain knowledge will allow you to anticipate what to expect and what is considered acceptable when analyzing the data. Always try to understand the business context around the data. For example, why is the data...

Technical requirements

You can download the Jupyter notebooks and the requisite datasets from the GitHub repository to follow along:

In this chapter and beyond, you will extensively use pandas 1.4.2 (released April 2, 2022). There will be four additional libraries that you will be using:

NumPy (≥ 1.20.3)
Matplotlib (≥ 3.5.0)
statsmodels (≥ 0.11.0)
scikit-learn (≥ 1.0.1)
SciPy (≥ 1.7.1)

If you are using pip, then you can install these packages from your terminal with the following command:

pip install matplotlib numpy statsmodels scikit-learn scipy

If you are using conda, then you can install these packages with the following command:

conda install matplotlib...

Understanding missing data

Data can be missing for a variety of reasons, such as unexpected power outages, a device that got accidentally unplugged, a sensor that just became defective, a survey respondent declined to answer a question, or the data was intentionally removed for privacy and compliance reasons. In other words, missing data is inevitable.

Generally, missing data is very common, yet sometimes it is not given the proper level of attention in terms of formulating a strategy on how to handle the situation. One approach for handling rows with missing data is to drop those observations (delete the rows). However, this may not be a good strategy if you have limited data in the first place, for example, if collecting the data is a complex and expensive process. Additionally, the drawback of deleting records, if done prematurely, is that you will not know if the missing data was due to censoring (an observation is only partially collected) or due to bias (for example, high...

Performing data quality checks

Missing data are values not captured or observed in the dataset. Values can be missing for a particular feature (column), or an entire observation (row). When ingesting the data using pandas, missing values will show up as either NaN, NaT, or NA.

Sometimes, missing observations are replaced with other values in the source system; for example, this can be a numeric filler such as 99999 or 0, or a string such as missing or N/A. When missing values are represented by 0, you need to be cautious and investigate further to determine whether those zero values are legitimate or they are indicative of missing data.

In this recipe, you will explore how to identify the presence of missing data.

Getting ready

You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.

You will be using two datasets from the Ch7 folder: clicks_missing_multiple.csv and...

Handling missing data with univariate imputation using pandas

Generally, there are two approaches to imputing missing data: univariate imputation and multivariate imputation. This recipe will explore univariate imputation techniques available in pandas.

In univariate imputation, you use non-missing values in a single variable (think a column or feature) to impute the missing values for that variable. For example, if you have a sales column in the dataset with some missing values, you can use a univariate imputation method to impute missing sales observations using average sales. Here, a single column (sales) was used to calculate the mean (from non-missing values) for imputation.

Some basic univariate imputation techniques include the following:

Imputing using the mean.
Imputing using the last observation forward (forward fill). This can be referred to as Last Observation Carried Forward (LOCF).
Imputing using the next observation backward (backward fill). This...

Handling missing data with univariate imputation using scikit-learn

scikit-learn is a very popular machine learning library in Python. The scikit-learn library offers a plethora of options for everyday machine learning tasks and algorithms such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Additionally, the library offers multiple options for univariate and multivariate data imputation.

Getting ready

You can download the Jupyter notebooks and requisite datasets from the GitHub repository. Please refer to the Technical requirements section of this chapter.

This recipe will utilize the three functions prepared earlier (read_dataset, rmse_score, and plot_dfs). You will be using four datasets from the Ch7 folder: clicks_original.csv, clicks_missing.csv, clicks_original.csv, and co2_missing_only.csv. The datasets are available from the GitHub repository.

How to do it…

You will start by importing the libraries...

Handling missing data with multivariate imputation

Earlier, we discussed the fact that there are two approaches to imputing missing data: univariate imputation and multivariate imputation.

As you have seen in the previous recipes, univariate imputation involves using one variable (column) to substitute for the missing data, disregarding other variables in the dataset. Univariate imputation techniques are usually faster and simpler to implement, but a multivariate approach may produce better results in most situations.

Instead of using a single variable (column), in a multivariate imputation, the method uses multiple variables within the dataset to impute missing values. The idea is simple: Have more variables within the dataset chime in to improve the predictability of missing values.

In other words, univariate imputation methods handle missing values for a particular variable in isolation of the entire dataset and just focus on that variable to derive the estimates. In...

Handling missing data with interpolation

Another commonly used technique for imputing missing values is interpolation. The pandas library provides the DataFrame.interpolate() method for more complex univariate imputation strategies.

For example, one of the interpolation methods available is linear interpolation. Linear interpolation can be used to impute missing data by drawing a straight line between the two points surrounding the missing value (in time series, this means for a missing data point, it looks at a prior past value and the next future value to draw a line between them). A polynomial interpolation, on the other hand, will attempt to draw a curved line between the two points. Hence, each method will have a different mathematical operation to determine how to fill in for the missing data.

The interpolation capabilities in pandas can be extended further through the SciPy library, which offers additional univariate and multivariate interpolations.

In this recipe...

The rest of the chapter is locked

You have been reading a chapter from

Time Series Analysis with Python Cookbook

Published in: Jun 2022Publisher: PacktISBN-13: 9781801075541

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Tarek A. Atwan

Tarek A. Atwan is a data analytics expert with over 16 years of international consulting experience, providing subject matter expertise in data science, machine learning operations, data engineering, and business intelligence. He has taught multiple hands-on coding boot camps, courses, and workshops on various topics, including data science, data visualization, Python programming, time series forecasting, and blockchain at various universities in the United States. He is regarded as a data science mentor and advisor, working with executive leaders in numerous industries to solve complex problems using a data-driven approach.
Read more about Tarek A. Atwan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages