You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Assembling an imputation pipeline with scikit-learn

Datasets often contain a mix of numerical and categorical variables. In addition, some variables may contain a few missing data points, while others will contain quite a big proportion. The mechanisms by which data is missing may also vary among variables. Thus, we may wish to perform different imputation procedures for different variables. In this recipe, we will learn how to perform different imputation procedures for different feature subsets using scikit-learn.

How to do it...

To proceed with the recipe, let's import the required libraries and classes and prepare the dataset:

Let's import pandas and the required classes from scikit-learn:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

Let's divide the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
        random_state=0)

Let's group a subset of columns to which we want to apply different imputation techniques in lists:

features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']

Let's create different imputation transformers using SimpleImputer() within the scikit-learn pipeline:

imputer_num_arbitrary = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=99)),
])
imputer_num_median = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])
imputer_cat_frequent = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
])
imputer_cat_missing = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
])

We have covered all these imputation strategies in dedicated recipes throughout this chapter.

Now, let's assemble the pipelines with the imputers within ColumnTransformer() and assign them to the different feature subsets we created in step 4:

preprocessor = ColumnTransformer(transformers=[
    ('imp_num_arbitrary', imputer_num_arbitrary, 
                        features_num_arbitrary),
    ('imp_num_median', imputer_num_median, features_num_median),
    ('imp_cat_frequent', imputer_cat_frequent, features_cat_frequent),
    ('imp_cat_missing', imputer_cat_missing, features_cat_missing),
    ], remainder='passthrough')

Next, we need to fit the preprocessor to the train set so that the imputation parameters are learned:

preprocessor.fit(X_train)

Finally, let's replace the missing values in the train and test sets:

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

Remember that scikit-learn transformers return NumPy arrays. The beauty of this procedure is that we can save the preprocessor in one object to perpetuate all the parameters that are learned by the different transformers.

How it works...

In this recipe, we carried out different imputation techniques over different variable groups using scikit-learn's SimpleImputer() and ColumnTransformer().

After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute by a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.

Next, we created multiple imputation objects using SimpleImputer() in a scikit-learn pipeline. To assemble each Pipeline(), we gave each step a name with a string. In our example, we used imputer. Next to this, we created the imputation object with SimpleImputer(), varying the strategy for the different imputation techniques.

Next, we arranged pipelines with different imputation strategies within ColumnTransformer(). To set up ColumnTransformer(), we gave each step a name with a string. Then, we added one of the created pipelines and the list with the features which should be imputed with said pipeline.

Next, we fitted ColumnTransformer() to the train set, where the imputers learned the values to be used to replace missing data from the train set. Finally, we imputed the missing values in the train and test sets, using the transform() method of ColumnTransformer() to obtain complete NumPy arrays.