You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Capturing missing values in a bespoke category

Missing data in categorical variables can be treated as a different category, so it is common to replace missing values with the Missing string. In this recipe, we will learn how to do so using pandas, scikit-learn, and Feature-engine.

How to do it...

To proceed with the recipe, let's import the required tools and prepare the dataset:

Import pandas and the required functions and classes from scikit-learn and Feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import CategoricalVariableImputer

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Let's separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
    random_state=0)

Let's replace missing values in four categorical variables by using the Missing string:

for var in ['A4', 'A5', 'A6', 'A7']:
    X_train[var].fillna('Missing', inplace=True)
    X_test[var].fillna('Missing', inplace=True)

Alternatively, we can replace missing values with the Missing string using scikit-learn as follows.

First, let's separate the data into train and test sets while keeping only categorical variables:

X_train, X_test, y_train, y_test = train_test_split(
    data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)

Let's set up SimpleImputer() so that it replaces missing data with the Missing string and fit it to the train set:

imputer = SimpleImputer(strategy='constant', fill_value='Missing')
imputer.fit(X_train)

SimpleImputer() from scikit-learn will replace missing values with Missing in both numerical and categorical variables. Be careful of this behavior or you will end up accidentally casting your numerical variables as objects.

Let's replace the missing values:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Remember that SimpleImputer() returns a NumPy array, which you can transform into a dataframe using pd.DataFrame(X_train, columns = ['A4', 'A5', 'A6', 'A7']).

To finish, let's impute missing values using Feature-engine. First, we need to separate the dataset, just like we did in step 3 of this recipe.

Next, let's set up the CategoricalVariableImputer() from Feature-engine, which replaces missing values with the Missing string, specifying the categorical variables to impute, and then fit the transformer to the train set:

imputer = CategoricalVariableImputer(variables=['A4', 'A5', 'A6', 'A7'])
imputer.fit(X_train)

If we don't pass a list with categorical variables, FrequentCategoryImputer() will select all categorical variables in the train set.

Finally, let's replace the missing values:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Remember that you can check that missing values have been replaced with pandas' isnull(), followed by sum().

How it works...

In this recipe, we replaced the missing values in categorical variables in the Credit Approval Data Set by using the Missing string using pandas, scikit-learn, and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe. To impute missing data with pandas, we used the fillna() method, passed the Missing string as an argument and set inplace=True to replace the values directly in the original dataframe.

To replace missing values using scikit-learn, we called SimpleImputer(), set strategy to constant, and added the Missing string to the fill_value argument. Next, we fitted the imputer to the train set and replaced missing values using the transform() method in the train and test sets, which returned NumPy arrays.

Finally, we replaced missing values with FrequentCategoryImputer() from Feature-engine, specifying the variables to impute in a list. With the fit() method, FrequentCategoryImputer() checked that the variables were categorical, and with transform() missing values were replaced with the Missing string in both train and test sets, thereby returning pandas dataframes.

Note that, unlike SimpleImputer(), CategoricalVariableImputer() will not impute numerical variables.