You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Adding a missing value indicator variable

A missing indicator is a binary variable that specifies whether a value was missing for an observation (1) or not (0). It is common practice to replace missing observations by the mean, median, or mode while flagging those missing observations with a missing indicator, thus covering two angles: if the data was missing at random, this would be contemplated by the mean, median, or mode imputation, and if it wasn't, this would be captured by the missing indicator. In this recipe, we will learn how to add missing indicators using NumPy, scikit-learn, and Feature-engine.

Getting ready

For an example of the implementation of missing indicators, along with mean imputation, check out the Winning the KDD Cup Orange Challenge with Ensemble Selection article, which was the winning solution in the KDD 2009 cup: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf.

How to do it...

Let's begin by importing the required packages and preparing the data:

Let's import the required libraries, functions and classes:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator
from feature_engine.missing_data_imputers import AddNaNBinaryImputer

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Let's separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
    random_state=0)

Using NumPy, we'll add a missing indicator to the numerical and categorical variables in a loop:

for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
    X_train[var + '_NA'] = np.where(X_train[var].isnull(), 1, 0)
    X_test[var + '_NA'] = np.where(X_test[var].isnull(), 1, 0)

Note how we name the new missing indicators using the original variable name, plus _NA.

Let's inspect the result of the preceding code block:

X_train.head()

We can see the newly added variables at the end of the dataframe:

The mean of the new variables and the percentage of missing values in the original variables should be the same, which you can corroborate by executing X_train['A3'].isnull().mean(), X_train['A3_NA'].mean().

Now, let's add missing indicators using Feature-engine instead. First, we need to load and divide the data, just like we did in step 2 and step 3 of this recipe.

Next, let's set up a transformer that will add binary indicators to all the variables in the dataset using AddNaNBinaryImputer() from Feature-engine:

imputer = AddNaNBinaryImputer()

We can specify the variables which should have missing indicators by passing the variable names in a list: imputer = AddNaNBinaryImputer(variables = ['A2', 'A3']). Alternatively, the imputer will add indicators to all the variables.

Let's fit AddNaNBinaryImputer() to the train set:

imputer.fit(X_train)

Finally, let's add the missing indicators:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

We can inspect the result using X_train.head(); it should be similar to the output of step 5 in this recipe.

We can also add missing indicators using scikit-learn's MissingIndicator() class. To do this, we need to load and divide the dataset, just like we did in step 2 and step 3.

Next, we'll set up a MissingIndicator(). Here, we will add indicators only to variables with missing data:

indicator = MissingIndicator(features='missing-only')

Let's fit the transformer so that it finds the variables with missing data in the train set:

indicator.fit(X_train)

Now, we can concatenate the missing indicators that were created by MissingIndicator() to the train set.

First, let's create a column name for each of the new missing indicators with a list comprehension:

indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]

The features_ attribute contains the indices of the features for which missing indicators will be added. If we pass these indices to the train set column array, we can get the variable names.

Next, let's concatenate the original train set with the missing indicators, which we obtain using the transform method:

X_train = pd.concat([
    X_train.reset_index(),
    pd.DataFrame(indicator.transform(X_train), 
                 columns = indicator_cols)], axis=1)

Scikit-learn transformers return NumPy arrays, so to concatenate them into a dataframe, we must cast it as a dataframe using pandas DataFrame().

The result of the preceding code block should contain the original variables, plus the indicators.

How it works...

In this recipe, we added missing value indicators to categorical and numerical variables in the Credit Approval Data Set using NumPy, scikit-learn, and Feature-engine. To add missing indicators using NumPy, we used the where() method, which created a new vector after scanning all the observations in a variable, assigning the value of 1 if there was a missing observation or 0 otherwise. We captured the indicators in columns with the name of the original variable, plus _NA.

To add a missing indicator with Feature-engine, we created an instance of AddNaNBinaryImputer() and fitted it to the train set. Then, we used the transform() method to add missing indicators to the train and test sets. Finally, to add missing indicators with scikit-learn, we created an instance of MissingIndicator() so that we only added indicators to variables with missing data. With the fit() method, the transformer identified variables with missing values. With transform(), it returned a NumPy array with binary indicators, which we captured in a dataframe and then concatenated to the original dataframe.

There's more...

We can add missing indicators using scikit-learn's SimpleImputer() by setting the add_indicator argument to True. For example, imputer = SimpleImputer(strategy=’mean’, add_indicator=True) will return a NumPy array with missing indicators, plus the missing values in the original variables were replaced by the mean after using the fit() and transform() methods.