Imputing Missing Data
Missing data, that is, the absence of values for certain observations, is an unavoidable problem in most data sources. Scikit-learn, the most commonly used Python library for machine learning, does not support missing values as input to machine learning models. Thus, we must remove observations with missing data or transform them into permitted values.
The act of replacing missing data with statistical estimates of missing values is called imputation. The goal of any imputation technique is to produce a complete dataset. There are multiple imputation methods that we can use, depending on whether the data is missing at random, the proportion of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several imputation methods.
This chapter will cover the following recipes:
- Removing observations with missing data
- Performing mean or median imputation
- Imputing categorical variables
- Replacing missing...
Technical requirements
In this chapter, we will use the Matplotlib, pandas
, NumPy, scikit-learn, and feature-engine
Python libraries. If you need to install Python, the free Anaconda Python distribution (https://www.anaconda.com/) comes with most numerical computing libraries out of the box.
Feature-engine
can be installed with pip
:
pip install feature-engine
If you use Anaconda, you can install feature-engine
with conda
:
conda install -c conda-forge feature_engine
We will use the Credit Approval dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/). To prepare the dataset, follow these steps:
- Visit https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/.
- Click on crx.data to download the data.
Figure 1.1 – Screenshot of the dataset download page
- Save
crx.data
to the folder where you will run the following commands.
Open a Jupyter notebook and run the following...
Removing observations with missing data
Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing for many variables, CCA may lead to the removal of a large portion of the dataset.
How to do it...
Let’s begin by making some imports and loading the dataset:
- Let’s import the
pandas
andmatplotlib
libraries:import matplotlib.pyplot as plt import pandas as pd
- Load the dataset that we prepared in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv")
- Find the proportion of missing values per variable, sort them in ascending order, and then make a bar plot, rotating the ticks on the xaxis and adding...
Performing mean or median imputation
Mean or median imputation consists of replacing missing values with the mean or median variable. The mean or median is calculated using a train set, and these values are used to impute missing data in train and test sets, as well as in all future data we intend to use with the machine learning model. Scikit-learn and feature-engine
transformers learn the mean or median from the train set and store these parameters for future use out of the box. In this recipe, we will perform mean and median imputation using pandas
, scikit-learn, and feature-engine
.
Tip
Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the distribution of the original variables if there is a high percentage of missing data.
How to do it...
Imputing categorical variables
Categorical variables usually contain strings as values, instead of numbers. We replace missing data in categorical variables with the most frequent category, or with a different string. Frequent categories are estimated using the train set and then used to impute values in the train, test, and future datasets. Thus, we need to learn and store these values, which we can do using scikit-learn and feature-engine
’s out-of-the-box transformers. In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.
How to do it...
To begin, let’s make a few imports and prepare the data:
- Let’s import
pandas
and the required functions and classes from scikit-learn andfeature-engine
:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from feature_engine.imputation...
Replacing missing values with an arbitrary number
Arbitrary number imputation consists of replacing missing data with an arbitrary value. Commonly used values include 999
, 9999
, or -1
for positive distributions. This method is suitable for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.
When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.
Tip
Arbitrary number imputation can be used when data is not missing at random, when we are building non-linear models, and when the percentage of missing data is high. This imputation technique distorts the original variable distribution.
In this recipe, we will impute missing data with arbitrary numbers using pandas
, scikit-learn, and feature-engine
.
How to do it...
Let’...
Finding extreme values for imputation
Replacing missing values with a value at the end of the variable distribution (extreme values) is equivalent to replacing them with an arbitrary value, but instead of identifying the arbitrary values manually, these values are automatically selected as those at the very end of the variable distribution. Missing data can be replaced with a value that is greater or smaller than the remaining values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation, or the 75th quantile + (IQR * 1.5), where IQR is the IQR given by the 75th quantile - the 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – (IQR * 1.5).
Note
End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable for linear models.
In this recipe, we...
Marking imputed values
A missing indicator is a binary variable that takes the value 1
or True
to indicate whether a value was missing, or 0
or False
otherwise. It is common practice to replace missing observations with the mean, median, or most frequent category while simultaneously marking those missing observations with missing indicators. In this recipe, we will learn how to add missing indicators using pandas
, scikit-learn, and feature-engine
.
How to do it...
Let’s begin by making some imports and loading the data:
- Let’s import the required libraries, functions, and classes:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from feature_engine.imputation import( AddMissingIndicator, CategoricalImputer, MeanMedianImputer ...
Performing multivariate imputation by chained equations
Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. In other words, the missing values of a variable are modeled based on the other variables in the dataset. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables and uses that estimate for imputation.
The following steps are required to perform MICE:
- A simple univariate imputation is performed for every variable with missing data, for example, median imputation.
- One specific variable is selected, say,
var_1
, and the missing values are set back to missing. - A model is trained to predict
var_1
using the remaining variables as input features. - The missing values of
var_1
are replaced with the new estimates. - Steps 2 to 4 are repeated for each of the remaining variables.
Once all the variables have been...
Estimating missing data with nearest neighbors
In imputation with K-Nearest Neighbors (KNN), missing values are replaced with the mean value from their k closest neighbors. The neighbors of each observation are found utilizing distances like the Euclidean distance, and the replacement value can be estimated as the mean or weighted mean of the neighbor’s value, where further neighbors have less influence on the replacement value. In this recipe, we will perform KNN imputation using scikit-learn.
How to do it...
To proceed with the recipe, let’s import the required libraries and prepare the data:
- Let’s import the required libraries, classes, and functions:
import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import KNNImputer
- Let’s load the dataset that we prepared in the Technical requirements section only with some numerical variables:
variables = ["A2", "...