Let's begin by importing the necessary tools and loading and preparing the data:
- Import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import ArbitraryNumberImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- Let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
Normally, we select arbitrary values that are bigger than the maximum value of the distribution.
- Let's find the maximum value of four numerical variables:
X_train[['A2','A3', 'A8', 'A11']].max()
The following is the output of the preceding code block:
A2 76.750
A3 26.335
A8 20.000
A11 67.000
dtype: float64
- Let's replace the missing values with 99 in the numerical variables that we specified in step 4:
for var in ['A2','A3', 'A8', 'A11']:
X_train[var].fillna(99, inplace=True)
X_test[var].fillna(99, inplace=True)
We chose 99 as the arbitrary value because it is bigger than the maximum value of these variables.
We can check the percentage of missing values using X_train[['A2','A3', 'A8', 'A11']].isnull().mean(), which should be 0 after step 5.
Now, we'll impute missing values with an arbitrary number using scikit-learn instead.
- First, let's separate the data into train and test sets while keeping only the numerical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A2', 'A3', 'A8', 'A11']], data['A16'], test_size=0.3,
random_state=0)
- Let's set up SimpleImputer() so that it replaces any missing values with 99:
imputer = SimpleImputer(strategy='constant', fill_value=99)
If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.
- Let's fit the imputer to the train set:
imputer.fit(X_train)
- Let's replace the missing values with 99:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Note that SimpleImputer() will return a NumPy array. Be mindful of the order of the variables if you're transforming the array back into a dataframe.
To finish, let's impute missing values using Feature-engine. First, we need to load the data and separate it into train and test sets, just like we did in step 2 and step 3.
- Next, let's create an imputation transformer with Feature-engine's ArbitraryNumberImputer()Â in order to replace any missing values with 99Â and specify the variables from which missing data should be imputed:
imputer = ArbitraryNumberImputer(arbitrary_number=99,
variables=['A2','A3', 'A8', 'A11'])
ArbitraryNumberImputer() will automatically select all numerical variables in the train set; that is, unless we specify which variables to impute in a list.
- Let's fit the arbitrary number imputer to the train set:
imputer.fit(X_train)
- Finally, let's replace the missing values with 99:
The variables specified in step 10 should now have missing data replaced with the number 99.