You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Implementing random sample imputation

Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.

How to do it...

Let's begin by importing the required libraries and tools and preparing the dataset:

Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:

import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import RandomSampleImputer

Let's load the dataset:

data = pd.read_csv('creditApprovalUCI.csv')

The random values that will be used to replace missing data should be extracted from the train set, so let's separate the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(
    data.drop('A16', axis=1), data['A16'], test_size=0.3, 
    random_state=0)

First, we will run the commands line by line to understand their output. Then, we will execute them in a loop to impute several variables. In random sample imputation, we extract as many random values as there is missing data in the variable.

Let's calculate the number of missing values in the A2 variable:

number_na = X_train['A2'].isnull().sum()

If you print the number_na variable, you will obtain 11 as output, which is the number of missing values in A2. Thus, let's extract 11 values at random from A2 for the imputation:

random_sample_train = X_train['A2'].dropna().sample(number_na, 
                            random_state=0)

We can only use one pandas Series to replace values in another pandas Series if their indexes are identical, so let's re-index the extracted random values so that they match the index of the missing values in the original dataframe:

random_sample_train.index = X_train[X_train['A2'].isnull()].index

Now, let's replace the missing values in the original dataset with randomly extracted values:

X_train.loc[X_train['A2'].isnull(), 'A2'] = random_sample_train

Now, let's combine step 4 to step 7 in a loop to replace the missing data in the variables in various train and test sets:

for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:

    # extract a random sample
    random_sample_train = X_train[var].dropna().sample(
        X_train[var].isnull().sum(), random_state=0)

    random_sample_test = X_train[var].dropna().sample(
        X_test[var].isnull().sum(), random_state=0)

    # re-index the randomly extracted sample
    random_sample_train.index = X_train[
            X_train[var].isnull()].index
    random_sample_test.index = X_test[X_test[var].isnull()].index

    # replace the NA
    X_train.loc[X_train[var].isnull(), var] = random_sample_train
    X_test.loc[X_test[var].isnull(), var] = random_sample_test

Note how we always extract values from the train set, but we calculate the number of missing values and the index using the train or test sets, respectively.

To finish, let's impute missing values using Feature-engine. First, we need to separate the data into train and test, just like we did in step 3 of this recipe.

Next, let's set up RandomSamplemputer() and fit it to the train set:

imputer = RandomSampleImputer()
imputer.fit(X_train)

RandomSampleImputer() will replace the values in all variables in the dataset by default.

We can specify the variables to impute by passing variable names in a list to the imputer using imputer = RandomSampleImputer(variables = ['A2', 'A3']).

Finally, let's replace the missing values:

X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

To obtain reproducibility between code runs, we can set the random_state to a number when we initialize the RandomSampleImputer(). It will use the random_state at each run of the transform() method.

How it works...

In this recipe, we replaced missing values in the numerical and categorical variables of the Credit Approval Data Set with values extracted at random from the same variables using pandas and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe.

To perform random sample imputation using pandas, we calculated the number of missing values in the variable using pandas isnull(), followed by sum(). Next, we used pandas dropna() to drop missing information from the original variable in the train set so that we extracted values from observations with data using pandas sample(). We extracted as many observations as there was missing data in the variable to impute. Next, we re-indexed the pandas Series with the randomly extracted values so that we could assign those to the missing observations in the original dataframe. Finally, we replaced the missing values with values extracted at random using pandas' loc, which takes the location of the rows with missing data and the name of the column to which the new values are to be assigned as arguments.

We also carried out random sample imputation with RandomSampleImputer() from Feature-engine. With the fit() method, the RandomSampleImputer() stores a copy of the train set. With transform(), the imputer extracts values at random from the stored dataset and replaces the missing information with them, thereby returning complete pandas dataframes.