Missing data in categorical variables can be treated as a different category, so it is common to replace missing values with the Missing string. In this recipe, we will learn how to do so using pandas, scikit-learn, and Feature-engine.
Capturing missing values in a bespoke category
How to do it...
To proceed with the recipe, let's import the required tools and prepare the dataset:
- Import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import CategoricalVariableImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
- Let's replace missing values in four categorical variables by using the Missing string:
for var in ['A4', 'A5', 'A6', 'A7']:
X_train[var].fillna('Missing', inplace=True)
X_test[var].fillna('Missing', inplace=True)
Alternatively, we can replace missing values with the Missing string using scikit-learn as follows.
- First, let's separate the data into train and test sets while keeping only categorical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, random_state=0)
- Let's set up SimpleImputer() so that it replaces missing data with the Missing string and fit it to the train set:
imputer = SimpleImputer(strategy='constant', fill_value='Missing')
imputer.fit(X_train)
- Let's replace the missing values:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
To finish, let's impute missing values using Feature-engine. First, we need to separate the dataset, just like we did in step 3 of this recipe.Â
- Next, let's set up the CategoricalVariableImputer() from Feature-engine, which replaces missing values with the Missing string, specifying the categorical variables to impute, and then fit the transformer to the train set:
imputer = CategoricalVariableImputer(variables=['A4', 'A5', 'A6', 'A7'])
imputer.fit(X_train)
- Finally, let's replace the missing values:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Remember that you can check that missing values have been replaced with pandas'Â isnull(), followed by sum().
How it works...
In this recipe, we replaced the missing values in categorical variables in the Credit Approval Data Set by using the Missing string using pandas, scikit-learn, and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe. To impute missing data with pandas, we used the fillna() method, passed the Missing string as an argument and set inplace=True to replace the values directly in the original dataframe.
To replace missing values using scikit-learn, we called SimpleImputer(), set strategy to constant, and added the Missing string to the fill_value argument. Next, we fitted the imputer to the train set and replaced missing values using the transform() method in the train and test sets, which returned NumPy arrays.
Finally, we replaced missing values with FrequentCategoryImputer() from Feature-engine, specifying the variables to impute in a list. With the fit() method, FrequentCategoryImputer() checked that the variables were categorical, and with transform() missing values were replaced with the Missing string in both train and test sets, thereby returning pandas dataframes.
See also
To learn more about Feature-engine's CategoricalVariableImputer(), go to https://feature-engine.readthedocs.io/en/latest/imputers/CategoricalVariableImputer.html.