You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Assembling an imputation pipeline with Feature-engine

Feature-engine is an open source Python library that allows us to easily implement different imputation techniques for different feature subsets. Often, our datasets contain a mix of numerical and categorical variables, with few or many missing values. Therefore, we normally perform different imputation techniques on different variables, depending on the nature of the variable and the machine learning algorithm we want to build. With Feature-engine, we can assemble multiple imputation techniques in a single step, and in this recipe, we will learn how to do this.

How to do it...

Let's begin by importing the necessary Python libraries and preparing the data:

Let's import pandas and the required function and class from scikit-learn, and the missing data imputation module from Feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import feature_engine.missing_data_imputers as mdi

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Let's divide the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
            random_state=0)

Let's create lists with the names of the variables that we want to apply specific imputation techniques to:

features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']

Let's assemble an arbitrary value imputer, a median imputer, a frequent category imputer, and an imputer to replace any missing values with the Missing string within a scikit-learn pipeline:

pipe = Pipeline(steps=[
    ('imp_num_arbitrary', mdi.ArbitraryNumberImputer(
        variables = features_num_arbitrary)),
    ('imp_num_median', mdi.MeanMedianImputer(
        imputation_method = 'median', variables=features_num_median)),
    ('imp_cat_frequent', mdi.FrequentCategoryImputer(
        variables = features_cat_frequent)),
    ('imp_cat_missing', mdi.CategoricalVariableImputer(
        variables=features_cat_missing))
  ])

Note how we pass the feature lists we created in step 4 to the imputers.

Let's fit the pipeline to the train set so that each imputer learns and stores the imputation parameters:

pipe.fit(X_train)

Finally, let's replace missing values in the train and test sets:

X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

We can store the pipeline after fitting it as an object to perpetuate the use of the learned parameters.

How it works...

In this recipe, we performed different imputation techniques on different variable groups from the Credit Approval Data Set by utilizing Feature-engine within a single scikit-learn pipeline.

After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute with a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.

Next, we assembled the different Feature-engine imputers within a single scikit-learn pipeline. With ArbitraryNumberImputer(), we imputed missing values with the number 999; with MeanMedianImputer(), we performed median imputation; with FrequentCategoryImputer(), we replaced the missing values with the mode; and with CategoricalVariableImputer(), we replaced the missing values with the Missing string. We specified a list of features to impute within each imputer.

When assembling a scikit-learn pipeline, we gave each step a name using a string, and next to it we set up each of the Feature-engine imputers, specifying the feature subset within each imputer.

With the fit() method, the imputers learned and stored parameters and with transform() the missing values were replaced, returning complete pandas dataframes.

We can store the scikit-learn pipeline with Feature-engine's transformers as one object in order to perpetuate the learned parameters.