Replacing missing values with an arbitrary number
We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.
When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.
Note
We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.
In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.
How to do it...
Let’s begin by importing the necessary tools and loading the data:
- Import
pandasand the required functions and classes:import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from feature_engine.imputation import ArbitraryNumberImputer
- Let’s load the dataset described in the Technical requirements section:
data = pd.read_csv("credit_approval_uci.csv") - Let’s separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split( Â Â Â Â data.drop("target", axis=1), Â Â Â Â data["target"], Â Â Â Â test_size=0.3, Â Â Â Â random_state=0, )We will select arbitrary values greater than the maximum value of the distribution.
- Let’s find the maximum value of four numerical variables:
X_train[['A2','A3', 'A8', 'A11']].max()
The previous command returns the following output:
A2Â Â Â Â Â 76.750 A3Â Â Â Â Â 26.335 A8Â Â Â Â Â 28.500 A11Â Â Â Â 67.000 dtype: float64
We’ll use
99for the imputation because it is bigger than the maximum values of the numerical variables in step 4. - Let’s make a copy of the original DataFrames:
X_train_t = X_train.copy() X_test_t = X_test.copy()
- Now, we replace the missing values with
99:X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[ Â Â Â Â "A2", "A3", "A8", "A11"]].fillna(99) X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[ Â Â Â Â "A2", "A3", "A8", "A11"]].fillna(99)
Note
To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.
Now, we’ll impute missing values with an arbitrary number using scikit-learn.
- Let’s set up
imputerto replace missing values with99:imputer = SimpleImputer(strategy='constant', fill_value=99)
Note
If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.
- Let’s fit
imputerto a slice of the train set containing the variables to impute:vars = ["A2", "A3", "A8", "A11"] imputer.fit(X_train[vars])
- Replace the missing values with
99in the desired variables:X_train_t[vars] = imputer.transform(X_train[vars]) X_test_t[vars] = imputer.transform(X_test[vars])
Go ahead and check the lack of missing values by executing
X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().To finish, let’s impute missing values using
feature-engine. - Let’s set up the
imputerto replace missing values with99in 4 specific variables:imputer = ArbitraryNumberImputer( Â Â Â Â arbitrary_number=99, Â Â Â Â variables=["A2", "A3", "A8", "A11"], )
Note
ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.
- Finally, let’s replace the missing values with
99:X_train = imputer.fit_transform(X_train) X_test = imputer.transform(X_test)
Note
To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).
We have now replaced missing data with arbitrary numbers using three different open-source libraries.
How it works...
In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.
To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.
To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().
To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() method to replace missing data in the test set.