Missing data refers to the absence of values for certain observations and is an unavoidable problem in most data sources. Scikit-learn does not support missing values as input, so we need to remove observations with missing data or transform them into permitted values. The act of replacing missing data with statistical estimates of missing values is called imputation. The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models. There are multiple imputation techniques we can apply to our data. The choice of imputation technique we use will depend on whether the data is missing at random, the number of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several missing data imputation techniques.
In this chapter, we will use the Python libraries: pandas, NumPy and scikit-learn. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains all these packages.
We will also use the open source Python library called Feature-engine, which I created and can be installed using pip:
pip install feature-engine
To learn more about Feature-engine, visit the following sites:
- Home page: www.trainindata.com/feature-engine
- Docs: https://feature-engine.readthedocs.io
- GitHub: https://github.com/solegalli/feature_engine/
Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding those observations where the values in any of the variables are missing. CCA can be applied to categorical and numerical variables. CCA is quick and easy to implement and has the advantage that it preserves the distribution of the variables, provided the data is missing at random and only a small proportion of the data is missing. However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset.
Let's begin by loading pandas and the dataset:
- First, we'll import the pandas library:
Mean or median imputation consists of replacing missing values with the variable mean or median. This can only be performed in numerical variables. The mean or the median is calculated using a train set, and these values are used to impute missing data in train and test sets, as well as in future data we intend to score with the machine learning model. Therefore, we need to store these mean and median values. Scikit-learn and Feature-engine transformers learn the parameters from the train set and store these parameters for future use. So, in this recipe, we will learn how to perform mean or median imputation using the scikit-learn and Feature-engine libraries and pandas for comparison.
Mode imputation consists of replacing missing values with the mode. We normally use this procedure in categorical variables, hence the frequent category imputation name. Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers; in the following recipe, we will learn how to do so.
To begin, let's make a few imports and prepare the data:
Arbitrary number imputation consists of replacing missing values with an arbitrary value. Some commonly used values include 999, 9999, or -1 for positive distributions. This method is suitable for numerical variables. A similar method for categorical variables will be discussed in the Capturing missing values in a bespoke category recipe.
When replacing missing values with an arbitrary number, we need to be careful not to select a value close to the mean or the median, or any other common value of the distribution.
In this recipe, we will impute missing data by arbitrary numbers using pandas, scikit...
Missing data in categorical variables can be treated as a different category, so it is common to replace missing values with the Missing string. In this recipe, we will learn how to do so using pandas, scikit-learn, and Feature-engine.
To proceed with the recipe, let's import the required tools and prepare the dataset:
- Import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import CategoricalVariableImputer
- Let's load the dataset:
data = pd.read_csv('creditApprovalUCI...
Replacing missing values with a value at the end of the variable distribution is equivalent to replacing them with an arbitrary value, but instead of identifying the arbitrary values manually, these values are automatically selected as those at the very end of the variable distribution. The values that are used to replace missing information are estimated using the mean plus or minus three times the standard deviation if the variable is normally distributed, or the inter-quartile range (IQR) proximity rule otherwise. According to the IQR proximity rule, missing values will be replaced with the 75th quantile + (IQR * 1.5) at the right tail or by the 25th quantile - (IQR * 1.5) at the left tail. The IQR is given by the 75th quantile - the 25th quantile.
Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.
Let's begin by importing the required libraries and tools and preparing the dataset:
- Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
A missing indicator is a binary variable that specifies whether a value was missing for an observation (1) or not (0). It is common practice to replace missing observations by the mean, median, or mode while flagging those missing observations with a missing indicator, thus covering two angles: if the data was missing at random, this would be contemplated by the mean, median, or mode imputation, and if it wasn't, this would be captured by the missing indicator. In this recipe, we will learn how to add missing indicators using NumPy, scikit-learn, and Feature-engine.
For an example of the implementation of missing indicators, along with mean imputation...
Multivariate imputation methods, as opposed to univariate imputation, use the entire set of variables to estimate the missing values. In other words, the missing values of a variable are modeled based on the other variables in the dataset. Multivariate imputation by chained equations (MICE) is a multiple imputation technique that models each variable with missing values as a function of the remaining variables and uses that estimate for imputation. MICE has the following basic steps:
- A simple univariate imputation is performed for every variable with missing data, for example, median imputation.
- One specific variable is selected, say, var_1, and the missing values are set back to missing.
- A model that's used to predict var_1 is built based on the remaining variables in the dataset.
- The missing values...
Datasets often contain a mix of numerical and categorical variables. In addition, some variables may contain a few missing data points, while others will contain quite a big proportion. The mechanisms by which data is missing may also vary among variables. Thus, we may wish to perform different imputation procedures for different variables. In this recipe, we will learn how to perform different imputation procedures for different feature subsets using scikit-learn.
To proceed with the recipe, let's import the required libraries and classes and prepare the dataset:
- Let's import pandas and the required classes from scikit-learn:
Feature-engine is an open source Python library that allows us to easily implement different imputation techniques for different feature subsets. Often, our datasets contain a mix of numerical and categorical variables, with few or many missing values. Therefore, we normally perform different imputation techniques on different variables, depending on the nature of the variable and the machine learning algorithm we want to build. With Feature-engine, we can assemble multiple imputation techniques in a single step, and in this recipe, we will learn how to do this.
Let's begin by importing the necessary Python libraries and preparing the data: