You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Replacing missing values with an arbitrary number

Arbitrary number imputation consists of replacing missing values with an arbitrary value. Some commonly used values include 999, 9999, or -1 for positive distributions. This method is suitable for numerical variables. A similar method for categorical variables will be discussed in the Capturing missing values in a bespoke category recipe.

When replacing missing values with an arbitrary number, we need to be careful not to select a value close to the mean or the median, or any other common value of the distribution.

Arbitrary number imputation can be used when data is not missing at random, when we are building non-linear models, and when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data by arbitrary numbers using pandas, scikit-learn, and Feature-engine.

How to do it...

Let's begin by importing the necessary tools and loading and preparing the data:

Import pandas and the required functions and classes from scikit-learn and Feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import ArbitraryNumberImputer

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Let's separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
    random_state=0)

Normally, we select arbitrary values that are bigger than the maximum value of the distribution.

Let's find the maximum value of four numerical variables:

X_train[['A2','A3', 'A8', 'A11']].max()

The following is the output of the preceding code block:

A2     76.750
A3     26.335
A8     20.000
A11    67.000
dtype: float64

Let's replace the missing values with 99 in the numerical variables that we specified in step 4:

for var in ['A2','A3', 'A8', 'A11']:
    X_train[var].fillna(99, inplace=True)
    X_test[var].fillna(99, inplace=True)

We chose 99 as the arbitrary value because it is bigger than the maximum value of these variables.

We can check the percentage of missing values using X_train[['A2','A3', 'A8', 'A11']].isnull().mean(), which should be 0 after step 5.

Now, we'll impute missing values with an arbitrary number using scikit-learn instead.

First, let's separate the data into train and test sets while keeping only the numerical variables:

X_train, X_test, y_train, y_test = train_test_split(
    data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3, 
    random_state=0)

Let's set up SimpleImputer() so that it replaces any missing values with 99:

imputer = SimpleImputer(strategy='constant', fill_value=99)

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

Let's fit the imputer to the train set:

imputer.fit(X_train)

Let's replace the missing values with 99:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Note that SimpleImputer() will return a NumPy array. Be mindful of the order of the variables if you're transforming the array back into a dataframe.

To finish, let's impute missing values using Feature-engine. First, we need to load the data and separate it into train and test sets, just like we did in step 2 and step 3.

Next, let's create an imputation transformer with Feature-engine's ArbitraryNumberImputer() in order to replace any missing values with 99 and specify the variables from which missing data should be imputed:

imputer = ArbitraryNumberImputer(arbitrary_number=99, 
                        variables=['A2','A3', 'A8', 'A11'])

ArbitraryNumberImputer() will automatically select all numerical variables in the train set; that is, unless we specify which variables to impute in a list.

Let's fit the arbitrary number imputer to the train set:

imputer.fit(X_train)

Finally, let's replace the missing values with 99:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

The variables specified in step 10 should now have missing data replaced with the number 99.

How it works...

In this recipe, we replaced missing values in numerical variables in the Credit Approval Data Set with an arbitrary number, 99, using pandas, scikit-learn, and Feature-engine. We loaded the data and divided it into train and test sets using train_test_split() from scikit-learn, as described in the Performing mean or median imputation recipe.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using the pandas max() method. Next, we chose a value, 99, that was bigger than the maximum values of the selected variables. In step 5, we used a for loop over the numerical variables to replace any missing data with the pandas fillna() method while passing 99 as an argument and setting the inplace argument to True in order to replace the values in the original dataframe.

To replace missing values using scikit-learn, we called SimpleImputer(), set strategy to constant, and specified 99 as the arbitrary value in the fill_value argument. Next, we fitted the imputer to the train set with the fit() method and replaced missing values using the transform() method in the train and test sets. SimpleImputer() returned a NumPy array with the missing data replaced by 99.

Finally, we replaced missing values with ArbitraryValueImputer() from Feature-engine, specifying a value, 99, in the arbitrary_number argument. We also included the variables to impute in a list to the variables argument. Next, we applied the fit() method. ArbitraryNumberimputer() checked that the selected variables were numerical after applying the fit() method. With the transform() method, the missing values in the train and test sets were replaced with 99, thus returning dataframes without missing values in selected variables.

There's more...

Scikit-learn released the ColumnTransformer() object, which allows us to select specific variables so that we can apply a certain imputation method. To learn how to use ColumnTransformer(), check out the Assembling an imputation pipeline with scikit-learn recipe.