Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.
Implementing random sample imputation
How to do it...
Let's begin by importing the required libraries and tools and preparing the dataset:
- Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.missing_data_imputers import RandomSampleImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- The random values that will be used to replace missing data should be extracted from the train set, so let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
First, we will run the commands line by line to understand their output. Then, we will execute them in a loop to impute several variables. In random sample imputation, we extract as many random values as there is missing data in the variable.
- Let's calculate the number of missing values in the A2 variable:
number_na = X_train['A2'].isnull().sum()
- If you print the number_na variable, you will obtain 11 as output, which is the number of missing values in A2. Thus, let's extract 11 values at random from A2 for the imputation:
random_sample_train = X_train['A2'].dropna().sample(number_na,
random_state=0)
- We can only use one pandas Series to replace values in another pandas Series if their indexes are identical, so let's re-index the extracted random values so that they match the index of the missing values in the original dataframe:
random_sample_train.index = X_train[X_train['A2'].isnull()].index
- Now, let's replace the missing values in the original dataset with randomly extracted values:
X_train.loc[X_train['A2'].isnull(), 'A2'] = random_sample_train
- Now, let's combine step 4 to step 7 in a loop to replace the missing data in the variables in various train and test sets:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
# extract a random sample
random_sample_train = X_train[var].dropna().sample(
X_train[var].isnull().sum(), random_state=0)
random_sample_test = X_train[var].dropna().sample(
X_test[var].isnull().sum(), random_state=0)
# re-index the randomly extracted sample
random_sample_train.index = X_train[
X_train[var].isnull()].index
random_sample_test.index = X_test[X_test[var].isnull()].index
# replace the NA
X_train.loc[X_train[var].isnull(), var] = random_sample_train
X_test.loc[X_test[var].isnull(), var] = random_sample_test
To finish, let's impute missing values using Feature-engine. First, we need to separate the data into train and test, just like we did in step 3 of this recipe.
- Next, let's set up RandomSamplemputer() and fit it to the train set:
imputer = RandomSampleImputer()
imputer.fit(X_train)
We can specify the variables to impute by passing variable names in a list to the imputer using imputer = RandomSampleImputer(variables = ['A2', 'A3']).
- Finally, let's replace the missing values:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
How it works...
In this recipe, we replaced missing values in the numerical and categorical variables of the Credit Approval Data Set with values extracted at random from the same variables using pandas and Feature-engine. First, we loaded the data and divided it into train and test sets using train_test_split(), as described in the Performing mean or median imputation recipe.
To perform random sample imputation using pandas, we calculated the number of missing values in the variable using pandas isnull(), followed by sum(). Next, we used pandas dropna() to drop missing information from the original variable in the train set so that we extracted values from observations with data using pandas sample(). We extracted as many observations as there was missing data in the variable to impute. Next, we re-indexed the pandas Series with the randomly extracted values so that we could assign those to the missing observations in the original dataframe. Finally, we replaced the missing values with values extracted at random using pandas' loc, which takes the location of the rows with missing data and the name of the column to which the new values are to be assigned as arguments.
We also carried out random sample imputation with RandomSampleImputer() from Feature-engine. With the fit() method, the RandomSampleImputer() stores a copy of the train set. With transform(), the imputer extracts values at random from the stored dataset and replaces the missing information with them, thereby returning complete pandas dataframes.
See also
To learn more about Feature-engine's RandomSampleImputer(), go to https://feature-engine.readthedocs.io/en/latest/imputers/RandomSampleImputer.html. Pay particular attention to the different ways in which you can set the seed to ensure reproducibility.