Feature-engine is an open source Python library that allows us to easily implement different imputation techniques for different feature subsets. Often, our datasets contain a mix of numerical and categorical variables, with few or many missing values. Therefore, we normally perform different imputation techniques on different variables, depending on the nature of the variable and the machine learning algorithm we want to build. With Feature-engine, we can assemble multiple imputation techniques in a single step, and in this recipe, we will learn how to do this.
Assembling an imputation pipeline with Feature-engine
How to do it...
Let's begin by importing the necessary Python libraries and preparing the data:
- Let's import pandas and the required function and class from scikit-learn, and the missing data imputation module from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import feature_engine.missing_data_imputers as mdi
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Let's divide the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
- Let's create lists with the names of the variables that we want to apply specific imputation techniques to:
features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']
- Let's assemble an arbitrary value imputer, a median imputer, a frequent category imputer, and an imputer to replace any missing values with the Missing string within a scikit-learn pipeline:
pipe = Pipeline(steps=[
('imp_num_arbitrary', mdi.ArbitraryNumberImputer(
variables = features_num_arbitrary)),
('imp_num_median', mdi.MeanMedianImputer(
imputation_method = 'median', variables=features_num_median)),
('imp_cat_frequent', mdi.FrequentCategoryImputer(
variables = features_cat_frequent)),
('imp_cat_missing', mdi.CategoricalVariableImputer(
variables=features_cat_missing))
])
- Let's fit the pipeline to the train set so that each imputer learns and stores the imputation parameters:
pipe.fit(X_train)
- Finally, let's replace missing values in the train and test sets:
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)
We can store the pipeline after fitting it as an object to perpetuate the use of the learned parameters.
How it works...
In this recipe, we performed different imputation techniques on different variable groups from the Credit Approval Data Set by utilizing Feature-engine within a single scikit-learn pipeline.
After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute with a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.
Next, we assembled the different Feature-engine imputers within a single scikit-learn pipeline. With ArbitraryNumberImputer(), we imputed missing values with the number 999; with MeanMedianImputer(), we performed median imputation; with FrequentCategoryImputer(), we replaced the missing values with the mode; and with CategoricalVariableImputer(), we replaced the missing values with the Missing string. We specified a list of features to impute within each imputer.
With the fit() method, the imputers learned and stored parameters and with transform() the missing values were replaced, returning complete pandas dataframes.
See also
To learn more about Feature-engine, take a look at the following links:
- Feature-engine:Â www.trainindata.com/feature-engine
- Docs:Â https://feature-engine.readthedocs.io/en/latest/
- GitHub repository:Â https://github.com/solegalli/feature_engine/