Let's begin this recipe:
- First, we'll import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import MeanMedianImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
- In mean and median imputation, the mean or median values should be calculated using the variables in the train set; therefore, let's separate the data into train and test sets and their respective targets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
You can check the size of the returned datasets using pandas' shape: X_train.shape, X_test.shape.
- Let's check the percentage of missing values in the train set:
X_train.isnull().mean()
The following output shows the percentage of missing values for each variable:
A1 0.008282
A2 0.022774
A3 0.140787
A4 0.008282
A5 0.008282
A6 0.008282
A7 0.008282
A8 0.140787
A9 0.140787
A10 0.140787
A11 0.000000
A12 0.000000
A13 0.000000
A14 0.014493
A15 0.000000
dtype: float64
- Let's replace the missing values with the median in five numerical variables using pandas:
for var in ['A2', 'A3', 'A8', 'A11', 'A15']:
value = X_train[var].median()
X_train[var] = X_train[var].fillna(value)
X_test[var] = X_test[var].fillna(value)
Note how we calculate the median using the train set and then use this value to replace the missing data in the train and test sets.
To impute missing data with the mean, we use pandas' mean():value = X_train[var].mean().
If you run the code in step 4 after imputation, the percentage of missing values for the A2, A3, A8, A11, and A15 variables should be 0.
The pandas' fillna() returns a new dataset with imputed values by default. We can set the inplace argument to True to replace missing data in the original dataframe: X_train[var].fillna(inplace=True).
Now, let's impute missing values by the median using scikit-learn so that we can store learned parameters.
- To do this, let's separate the original dataset into train and test sets, keeping only the numerical variables:
X_train, X_test, y_train, y_test = train_test_split(
data[['A2', 'A3', 'A8', 'A11', 'A15']], data['A16'],
test_size=0.3, random_state=0)
SimpleImputer() from scikit-learn will impute all variables in the dataset. Therefore, if we use mean or median imputation and the dataset contains categorical variables, we will get an error.
- Let's create a median imputation transformer using SimpleImputer() from scikit-learn:
imputer = SimpleImputer(strategy='median')
To perform mean imputation, we should set the strategy to mean: imputer = SimpleImputer(strategy = 'mean').
- Let's fit the SimpleImputer() to the train set so that it learns the median values of the variables:
imputer.fit(X_train)
- Let's inspect the learned median values:
imputer.statistics_
The imputer stores median values in the statistics_ attribute, as shown in the following output:
array([28.835, 2.75 , 1. , 0. , 6. ])
- Let's replace missing values with medians:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
SimpleImputer() returns NumPy arrays. We can transform the array into a dataframe using pd.DataFrame(X_train, columns = ['A2', 'A3', 'A8', 'A11', 'A15']). Be mindful of the order of the variables.
Finally, let's perform median imputation using MeanMedianImputer() from Feature-engine. First, we need to load and divide the dataset, just like we did in step 2 and step 3. Next, we need to create an imputation transformer.
- Let's set up a median imputation transformer using MeanMedianImputer() from Feature-engine specifying the variables to impute:
median_imputer = MeanMedianImputer(imputation_method='median',
variables=['A2', 'A3', 'A8', 'A11', 'A15'])
To perform mean imputation, change the imputation method, as follows: MeanMedianImputer(imputation_method='mean').
- Let's fit the median imputer so that it learns the median values for each of the specified variables:
median_imputer.fit(X_train)
- Let's inspect the learned medians:
median_imputer.imputer_dict_
With the previous command, we can visualize the median values stored in a dictionary in the imputer_dict_ attribute:
{'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A15': 6.0}
Feature-engine's MeanMedianImputer() returns a dataframe. You can check that the imputed variables do not contain missing values using X_train[['A2','A3', 'A8', 'A11', 'A15']].isnull().mean().