A missing indicator is a binary variable that specifies whether a value was missing for an observation (1) or not (0). It is common practice to replace missing observations by the mean, median, or mode while flagging those missing observations with a missing indicator, thus covering two angles: if the data was missing at random, this would be contemplated by the mean, median, or mode imputation, and if it wasn't, this would be captured by the missing indicator. In this recipe, we will learn how to add missing indicators using NumPy, scikit-learn, and Feature-engine.
Adding a missing value indicator variable
Getting ready
For an example of the implementation of missing indicators, along with mean imputation, check out the Winning the KDD Cup Orange Challenge with Ensemble Selection article, which was the winning solution in the KDD 2009 cup: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf.
How to do it...
Let's begin by importing the required packages and preparing the data:
- Let's import the required libraries, functions and classes:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator
from feature_engine.missing_data_imputers import AddNaNBinaryImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
- Using NumPy, we'll add a missing indicator to the numerical and categorical variables in a loop:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
X_train[var + '_NA'] = np.where(X_train[var].isnull(), 1, 0)
X_test[var + '_NA'] = np.where(X_test[var].isnull(), 1, 0)
- Let's inspect the result of the preceding code block:
X_train.head()
We can see the newly added variables at the end of the dataframe:

Now, let's add missing indicators using Feature-engine instead. First, we need to load and divide the data, just like we did in step 2 and step 3 of this recipe.
- Next, let's set up a transformer that will add binary indicators to all the variables in the dataset using AddNaNBinaryImputer() from Feature-engine:
imputer = AddNaNBinaryImputer()
- Let's fit AddNaNBinaryImputer() to the train set:
imputer.fit(X_train)
- Finally, let's add the missing indicators:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
We can also add missing indicators using scikit-learn's MissingIndicator() class. To do this, we need to load and divide the dataset, just like we did in step 2 and step 3.
- Next, we'll set up a MissingIndicator(). Here, we will add indicators only to variables with missing data:
indicator = MissingIndicator(features='missing-only')
- Let's fit the transformer so that it finds the variables with missing data in the train set:
indicator.fit(X_train)
Now, we can concatenate the missing indicators that were created by MissingIndicator() to the train set.
- First, let's create a column name for each of the new missing indicators with a list comprehension:
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]
- Next, let's concatenate the original train set with the missing indicators, which we obtain using the transform method:
X_train = pd.concat([
X_train.reset_index(),
pd.DataFrame(indicator.transform(X_train),
columns = indicator_cols)], axis=1)
The result of the preceding code block should contain the original variables, plus the indicators.
How it works...
In this recipe, we added missing value indicators to categorical and numerical variables in the Credit Approval Data Set using NumPy, scikit-learn, and Feature-engine. To add missing indicators using NumPy, we used the where() method, which created a new vector after scanning all the observations in a variable, assigning the value of 1 if there was a missing observation or 0 otherwise. We captured the indicators in columns with the name of the original variable, plus _NA.
To add a missing indicator with Feature-engine, we created an instance of AddNaNBinaryImputer() and fitted it to the train set. Then, we used the transform() method to add missing indicators to the train and test sets. Finally, to add missing indicators with scikit-learn, we created an instance of MissingIndicator() so that we only added indicators to variables with missing data. With the fit() method, the transformer identified variables with missing values. With transform(), it returned a NumPy array with binary indicators, which we captured in a dataframe and then concatenated to the original dataframe.
There's more...
We can add missing indicators using scikit-learn's SimpleImputer() by setting the add_indicator argument to True. For example, imputer = SimpleImputer(strategy=’mean’, add_indicator=True) will return a NumPy array with missing indicators, plus the missing values in the original variables were replaced by the mean after using the fit() and transform() methods.
See also
To learn more about the transformers that were discussed in this recipe, take a look at the following links:
- Scikit-learn's MissingIndicator(): https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html
- Scikit-learn's SimpleImputer(): https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
- Feature-engine's AddNaNBinaryImputer(): https://feature-engine.readthedocs.io/en/latest/imputers/AddNaNBinaryImputer.html