Datasets often contain a mix of numerical and categorical variables. In addition, some variables may contain a few missing data points, while others will contain quite a big proportion. The mechanisms by which data is missing may also vary among variables. Thus, we may wish to perform different imputation procedures for different variables. In this recipe, we will learn how to perform different imputation procedures for different feature subsets using scikit-learn.
Assembling an imputation pipeline with scikit-learn
How to do it...
To proceed with the recipe, let's import the required libraries and classes and prepare the dataset:
- Let's import pandas and the required classes from scikit-learn:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Let's divide the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
- Let's group a subset of columns to which we want to apply different imputation techniques in lists:
features_num_arbitrary = ['A3', 'A8']
features_num_median = ['A2', 'A14']
features_cat_frequent = ['A4', 'A5', 'A6', 'A7']
features_cat_missing = ['A1', 'A9', 'A10']
- Let's create different imputation transformers using SimpleImputer() within the scikit-learn pipeline:
imputer_num_arbitrary = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=99)),
])
imputer_num_median = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
])
imputer_cat_frequent = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
])
imputer_cat_missing = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
])
- Now, let's assemble the pipelines with the imputers within ColumnTransformer() and assign them to the different feature subsets we created in step 4:
preprocessor = ColumnTransformer(transformers=[
('imp_num_arbitrary', imputer_num_arbitrary,
features_num_arbitrary),
('imp_num_median', imputer_num_median, features_num_median),
('imp_cat_frequent', imputer_cat_frequent, features_cat_frequent),
('imp_cat_missing', imputer_cat_missing, features_cat_missing),
], remainder='passthrough')
- Next, we need to fit the preprocessor to the train set so that the imputation parameters are learned:
preprocessor.fit(X_train)
- Finally, let's replace the missing values in the train and test sets:
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)
Remember that scikit-learn transformers return NumPy arrays. The beauty of this procedure is that we can save the preprocessor in one object to perpetuate all the parameters that are learned by the different transformers.
How it works...
In this recipe, we carried out different imputation techniques over different variable groups using scikit-learn's SimpleImputer() and ColumnTransformer().
After loading and dividing the dataset, we created four lists of features. The first list contained numerical variables to impute with an arbitrary value. The second list contained numerical variables to impute by the median. The third list contained categorical variables to impute by a frequent category. Finally, the fourth list contained categorical variables to impute with the Missing string.
Next, we created multiple imputation objects using SimpleImputer() in a scikit-learn pipeline. To assemble each Pipeline(), we gave each step a name with a string. In our example, we used imputer. Next to this, we created the imputation object with SimpleImputer(), varying the strategy for the different imputation techniques.
Next, we arranged pipelines with different imputation strategies within ColumnTransformer(). To set up ColumnTransformer(), we gave each step a name with a string. Then, we added one of the created pipelines and the list with the features which should be imputed with said pipeline.
Next, we fitted ColumnTransformer() to the train set, where the imputers learned the values to be used to replace missing data from the train set. Finally, we imputed the missing values in the train and test sets, using the transform() method of ColumnTransformer() to obtain complete NumPy arrays.
See also
To learn more about scikit-learn transformers and how to use them, take a look at the following links:
- SimpleImputer():Â https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
- ColumnTransformer():Â https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
- Stack Overflow:Â https://stackoverflow.com/questions/54160370/how-to-use-sklearn-column-transformer