Mode imputation consists of replacing missing values with the mode. We normally use this procedure in categorical variables, hence the frequent category imputation name. Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers; in the following recipe, we will learn how to do so.
Implementing mode or frequent category imputation
How to do it...
To begin, let's make a few imports and prepare the data:
- Let's import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import FrequentCategoryImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Frequent categories should be calculated using the train set variables, so let's separate the data into train and test sets and their respective targets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
- Let's replace missing values with the frequent category, that is, the mode, in four categorical variables:
for var in ['A4', 'A5', 'A6', 'A7']:
value = X_train[var].mode()[0]
X_train[var] = X_train[var].fillna(value)
X_test[var] = X_test[var].fillna(value)
Note how we calculate the mode in the train set and use that value to replace the missing data in the train and test sets.
Now, let's impute missing values by the most frequent category using scikit-learn.
- First, let's separate the original dataset into train and test sets and only retain the categorical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A4', 'A5', 'A6', 'A7']], data['A16'], test_size=0.3,
random_state=0)
- Let's create a frequent category imputer with SimpleImputer() from scikit-learn:
imputer = SimpleImputer(strategy='most_frequent')
- Let's fit the imputer to the train set so that it learns the most frequent values:
imputer.fit(X_train)
- Let's inspect the most frequent values learned by the imputer:
imputer.statistics_
The most frequent values are stored in the statistics_ attribute of the imputer, as follows:
array(['u', 'g', 'c', 'v'], dtype=object)
- Let's replace missing values with frequent categories:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Finally, let's impute missing values using Feature-engine. First, we need to load and separate the data into train and test sets, just like we did in step 2 and step 3 in this recipe.
- Next, let's create a frequent category imputer with FrequentCategoryImputer() from Feature-engine, specifying the categorical variables that should have missing data removed:
mode_imputer = FrequentCategoryImputer(variables=['A4', 'A5', 'A6', 'A7'])
- Let's fit the imputation transformer to the train set so that it learns the most frequent categories:
mode_imputer.fit(X_train)
- Let's inspect the learned frequent categories:
mode_imputer.imputer_dict_
We can see the dictionary with the most frequent values in the following output:
- Finally, let's replace the missing values with frequent categories:
X_train = mode_imputer.transform(X_train)
X_test = mode_imputer.transform(X_test)
FrequentCategoryImputer() returns a pandas dataframe with the imputed values.
How it works...
In this recipe, we replaced the missing values of the categorical variables in the Credit Approval Data Set with the most frequent categories using pandas, scikit-learn, and Feature-engine. Frequent categories should be learned from the train set, so we divided the dataset into train and test sets using train_test_split() from scikit-learn, as described in the Performing mean or median imputation recipe.
To impute missing data with pandas in multiple categorical variables, in step 4 we created a for loop over the categorical variables A4 to A7, and for each variable, we calculated the most frequent value using the pandas mode() method in the train set. Then, we used this value to replace the missing values with pandas fillna() in the train and test sets. Pandas fillna() returned a pandas Series without missing values, which we reassigned to the original variable in the dataframe.
To replace missing values using scikit-learn, we divided the data into train and test sets but only kept categorical variables. Next, we set up SimpleImputer() and specified most_frequent as the imputation method in the strategy. With the fit() method, imputer learned and stored frequent categories in its statistics_ attribute. With the transform() method, the missing values in the train and test sets were replaced with the learned statistics, returning NumPy arrays.
Finally, to replace the missing values via Feature-engine, we set up FrequentCategoryImputer(), specifying the variables to impute in a list. With fit(), the FrequentCategoryImputer()Â learned and stored frequent categories in a dictionary in the imputer_dict_Â attribute. With the transform() method, missing values in the train and test sets were replaced with stored parameters, which allowed us to obtain pandas dataframes without missing data.
See also
To learn more about scikit-learn's SimpleImputer() go to https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.
To learn more about Feature-engine's FrequentCategoryImputer(), go to https://feature-engine.readthedocs.io/en/latest/imputers/FrequentCategoryImputer.html.