You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Implementing mode or frequent category imputation

Mode imputation consists of replacing missing values with the mode. We normally use this procedure in categorical variables, hence the frequent category imputation name. Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers; in the following recipe, we will learn how to do so.

If the percentage of missing values is high, frequent category imputation may distort the original distribution of categories.

How to do it...

To begin, let's make a few imports and prepare the data:

Let's import pandas and the required functions and classes from scikit-learn and Feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import FrequentCategoryImputer

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Frequent categories should be calculated using the train set variables, so let's separate the data into train and test sets and their respective targets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
    random_state=0)

Remember that you can check the percentage of missing values in the train set with X_train.isnull().mean().

Let's replace missing values with the frequent category, that is, the mode, in four categorical variables:

for var in ['A4', 'A5', 'A6', 'A7']:
    value = X_train[var].mode()[0]
    X_train[var] = X_train[var].fillna(value)
    X_test[var] = X_test[var].fillna(value)

Note how we calculate the mode in the train set and use that value to replace the missing data in the train and test sets.

The pandas' fillna() returns a new dataset with imputed values by default. Instead of doing this, we can replace missing data in the original dataframe by executing X_train[var].fillna(inplace=True).

Now, let's impute missing values by the most frequent category using scikit-learn.

First, let's separate the original dataset into train and test sets and only retain the categorical variables:

X_train, X_test, y_train, y_test = train_test_split(
    data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3, 
    random_state=0)

Let's create a frequent category imputer with SimpleImputer() from scikit-learn:

imputer = SimpleImputer(strategy='most_frequent')

SimpleImputer() from scikit-learn will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

Let's fit the imputer to the train set so that it learns the most frequent values:

imputer.fit(X_train)

Let's inspect the most frequent values learned by the imputer:

imputer.statistics_

The most frequent values are stored in the statistics_ attribute of the imputer, as follows:

array(['u', 'g', 'c', 'v'], dtype=object)

Let's replace missing values with frequent categories:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Note that SimpleImputer() will return a NumPy array and not a pandas dataframe.

Finally, let's impute missing values using Feature-engine. First, we need to load and separate the data into train and test sets, just like we did in step 2 and step 3 in this recipe.

Next, let's create a frequent category imputer with FrequentCategoryImputer() from Feature-engine, specifying the categorical variables that should have missing data removed:

mode_imputer = FrequentCategoryImputer(variables=['A4', 'A5', 'A6', 'A7'])

FrequentCategoryImputer() will select all categorical variables in the train set by default; that is, unless we pass a list of variables to impute.

Let's fit the imputation transformer to the train set so that it learns the most frequent categories:

mode_imputer.fit(X_train)

Let's inspect the learned frequent categories:

mode_imputer.imputer_dict_

We can see the dictionary with the most frequent values in the following output:

{'A4': 'u', 'A5': 'g', 'A6': 'c', 'A7': 'v'}

Finally, let's replace the missing values with frequent categories:

X_train = mode_imputer.transform(X_train)
X_test = mode_imputer.transform(X_test)

FrequentCategoryImputer() returns a pandas dataframe with the imputed values.

Remember that you can check that the categorical variables do not contain missing values by using X_train[['A4', 'A5', 'A6', 'A7']].isnull().mean().

How it works...

In this recipe, we replaced the missing values of the categorical variables in the Credit Approval Data Set with the most frequent categories using pandas, scikit-learn, and Feature-engine. Frequent categories should be learned from the train set, so we divided the dataset into train and test sets using train_test_split() from scikit-learn, as described in the Performing mean or median imputation recipe.

To impute missing data with pandas in multiple categorical variables, in step 4 we created a for loop over the categorical variables A4 to A7, and for each variable, we calculated the most frequent value using the pandas mode() method in the train set. Then, we used this value to replace the missing values with pandas fillna() in the train and test sets. Pandas fillna() returned a pandas Series without missing values, which we reassigned to the original variable in the dataframe.

To replace missing values using scikit-learn, we divided the data into train and test sets but only kept categorical variables. Next, we set up SimpleImputer() and specified most_frequent as the imputation method in the strategy. With the fit() method, imputer learned and stored frequent categories in its statistics_ attribute. With the transform() method, the missing values in the train and test sets were replaced with the learned statistics, returning NumPy arrays.

Finally, to replace the missing values via Feature-engine, we set up FrequentCategoryImputer(), specifying the variables to impute in a list. With fit(), the FrequentCategoryImputer() learned and stored frequent categories in a dictionary in the imputer_dict_ attribute. With the transform() method, missing values in the train and test sets were replaced with stored parameters, which allowed us to obtain pandas dataframes without missing data.

Note that, unlike SimpleImputer() from scikit-learn, FrequentCategoryImputer() will only impute categorical variables and ignores numerical ones.