Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Feature Engineering Cookbook - Third Edition
Python Feature Engineering Cookbook - Third Edition

Python Feature Engineering Cookbook: A complete guide to crafting powerful features for your machine learning models, Third Edition

Profile Icon Soledad Galli
By Soledad Galli
$24.99 $35.99
Book Aug 2024 396 pages 3rd Edition
eBook
$24.99 $35.99
Print
$44.99
Subscription
Free Trial
Renews at $19.99p/m
Profile Icon Soledad Galli
By Soledad Galli
$24.99 $35.99
Book Aug 2024 396 pages 3rd Edition
eBook
$24.99 $35.99
Print
$44.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$24.99 $35.99
Print
$44.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Python Feature Engineering Cookbook - Third Edition

Imputing Missing Data

Missing data—meaning the absence of values for certain observations—is an unavoidable problem in most data sources. Some machine learning model implementations can handle missing data out of the box. To train other models, we must remove observations with missing data or transform them into permitted values.

The act of replacing missing data with their statistical estimates is called imputation. The goal of any imputation technique is to produce a complete dataset. There are multiple imputation methods. We select which one to use, depending on whether the data is missing at random, the proportion of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several imputation methods.

This chapter will cover the following recipes:

  • Removing observations with missing data
  • Performing mean or median imputation
  • Imputing categorical variables
  • Replacing missing values with an arbitrary number
  • Finding extreme values for imputation
  • Marking imputed values
  • Implementing forward and backward fill
  • Carrying out interpolation
  • Performing multivariate imputation by chained equations
  • Estimating missing data with nearest neighbors

Technical requirements

In this chapter, we will use the Python libraries Matplotlib, pandas, NumPy, scikit-learn, and Feature-engine. If you need to install Python, the free Anaconda Python distribution (https://www.anaconda.com/) includes most numerical computing libraries.

feature-engine can be installed with pip as follows:

pip install feature-engine

If you use Anaconda, you can install feature-engine with conda:

conda install -c conda-forge feature_engine

Note

The recipes from this chapter were created using the latest versions of the Python libraries at the time of publishing. You can check the versions in the requirements.txt file in the accompanying GitHub repository, at https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/requirements.txt.

We will use the Credit Approval dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/), licensed under the CC BY 4.0 creative commons attribution: https://creativecommons.org/licenses/by/4.0/legalcode. You’ll find the dataset at this link: http://archive.ics.uci.edu/dataset/27/credit+approval.

I downloaded and modified the data as shown in this notebook: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/credit-approval-dataset.ipynb

We will also use the air passenger dataset located in Facebook’s Prophet GitHub repository (https://github.com/facebook/prophet/blob/main/examples/example_air_passengers.csv), licensed under the MIT license: https://github.com/facebook/prophet/blob/main/LICENSE

I modified the data as shown in this notebook: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/air-passengers-dataset.ipynb

You’ll find a copy of the modified data sets in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding observations with missing data. CCA can be applied to both categorical and numerical variables. With CCA, we preserve the distribution of the variables after the imputation, provided the data is missing at random and only in a small proportion of observations. However, if data is missing across many variables, CCA may lead to the removal of a large portion of the dataset.

Note

Use CCA only when a small number of observations are missing and you have good reasons to believe that they are not important to your model.

How to do it...

Let’s begin by making some imports and loading the dataset:

  1. Let’s import pandas, matplotlib, and the train/test split function from scikit-learn:
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
  2. Let’s load and display the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
    data.head()

    In the following image, we see the first 5 rows of data:

Figure 1.1 – First 5 rows of the dataset

Figure 1.1 – First 5 rows of the dataset

  1. Let’s proceed as we normally would if we were preparing the data to train machine learning models; by splitting the data into a training and a test set:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.30,
        random_state=42,
    )
  2. Let’s now make a bar plot with the proportion of missing data per variable in the training and test sets:
    fig, axes = plt.subplots(
        2, 1, figsize=(15, 10), squeeze=False)
    X_train.isnull().mean().plot(
        kind='bar', color='grey', ax=axes[0, 0], title="train")
    X_test.isnull().mean().plot(
        kind='bar', color='black', ax=axes[1, 0], title="test")
    axes[0, 0].set_ylabel('Fraction of NAN')
    axes[1, 0].set_ylabel('Fraction of NAN')
    plt.show()

    The previous code block returns the following bar plots with the fraction of missing data per variable in the training (top) and test sets (bottom):

Figure 1.2 – Proportion of missing data per variable

Figure 1.2 – Proportion of missing data per variable

  1. Now, we’ll remove observations if they have missing values in any variable:
    train_cca = X_train.dropna()
    test_cca = X_test.dropna()

Note

pandas’ dropna()drops observations with any missing value by default. We can remove observations with missing data in a subset of variables like this: data.dropna(subset=["A3", "A4"]).

  1. Let’s print and compare the size of the original and complete case datasets:
    print(f"Total observations: {len(X_train)}")
    print(f"Observations without NAN: {len(train_cca)}")

    We removed more than 200 observations with missing data from the training set, as shown in the following output:

    Total observations: 483
    Observations without NAN: 264
  2. After removing observations from the training and test sets, we need to align the target variables:
    y_train_cca = y_train.loc[train_cca.index]
    y_test_cca = y_test.loc[test_cca.index]

    Now, the datasets and target variables contain the rows without missing data.

  3. To drop observations with missing data utilizing feature-engine, let’s import the required transformer:
    from feature_engine.imputation import DropMissingData
  4. Let’s set up the imputer to automatically find the variables with missing data:
    cca = DropMissingData(variables=None, missing_only=True)
  5. Let’s fit the transformer so that it finds the variables with missing data:
    cca.fit(X_train)
  6. Let’s inspect the variables with NAN that the transformer found:
    cca.variables_

    The previous command returns the names of the variables with missing data:

    ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A14']
  7. Let’s remove the rows with missing data in the training and test sets:
    train_cca = cca.transform(X_train)
    test_cca = cca.transform(X_test)

    Use train_cca.isnull().sum() to corroborate the absence of missing data in the complete case dataset.

  8. DropMissingData can automatically adjust the target after removing missing data from the training set:
    train_c, y_train_c = cca.transform_x_y( X_train, y_train)
    test_c, y_test_c = cca.transform_x_y(X_test, y_test)

The previous code removed rows with nan from the training and test sets and then re-aligned the target variables.

Note

To remove observations with missing data in a subset of variables, use DropMissingData(variables=['A3', 'A4']). To remove rows with nan in at least 5% of the variables, use DropMissingData(threshold=0.95).

How it works...

In this recipe, we plotted the proportion of missing data in each variable and then removed all observations with missing values.

We used pandas isnull() and mean() methods to determine the proportion of missing observations in each variable. The isnull() method created a Boolean vector per variable with True and False values indicating whether a value was missing. The mean() method took the average of these values and returned the proportion of missing data.

We used pandas plot.bar() to create a bar plot of the fraction of missing data per variable. In Figure 1.2, we saw the fraction of nan per variable in the training and test sets.

To remove observations with missing values in any variable, we used pandas’ dropna(), thereby obtaining a complete case dataset.

Finally, we removed missing data using Feature-engine’s DropMissingData(). This imputer automatically identified and stored the variables with missing data from the train set when we called the fit() method. With the transform() method, the imputer removed observations with nan in those variables. With transform_x_y(), the imputer removed rows with nan from the data sets and then realigned the target variable.

See also

If you want to use DropMissingData() within a pipeline together with other Feature-engine or scikit-learn transformers, check out Feature-engine’s Pipeline: https://Feature-engine.trainindata.com/en/latest/user_guide/pipeline/Pipeline.html. This pipeline can align the target with the training and test sets after removing rows.

Performing mean or median imputation

Mean or median imputation consists of replacing missing data with the variable’s mean or median value. To avoid data leakage, we determine the mean or median using the train set, and then use these values to impute the train and test sets, and all future data.

Scikit-learn and Feature-engine learn the mean or median from the train set and store these parameters for future use out of the box.

In this recipe, we will perform mean and median imputation using pandas, scikit-learn, and feature-engine.

Note

Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the variable distribution if there is a high percentage of missing data.

How to do it...

Let’s begin this recipe:

  1. First, we’ll import pandas and the required functions and classes from scikit-learn and feature-engine:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from feature_engine.imputation import MeanMedianImputer
  2. Let’s load the dataset that we prepared in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets with their respective targets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s make a list with the numerical variables by excluding variables of type object:
    numeric_vars = X_train.select_dtypes(
        exclude="O").columns.to_list()

    If you execute numeric_vars, you will see the names of the numerical variables: ['A2', 'A3', 'A8', 'A11', 'A14', 'A15'].

  5. Let’s capture the variables’ median values in a dictionary:
    median_values = X_train[
        numeric_vars].median().to_dict()

Tip

Note how we calculate the median using the train set. We will use these values to replace missing data in the train and test sets. To calculate the mean, use pandas mean() instead of median().

If you execute median_values, you will see a dictionary with the median value per variable: {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}.

  1. Let’s replace missing data with the median:
    X_train_t = X_train.fillna(value=median_values)
    X_test_t = X_test.fillna(value=median_values)

    If you execute X_train_t[numeric_vars].isnull().sum() after the imputation, the number of missing values in the numerical variables should be 0.

Note

pandas fillna() returns a new dataset with imputed values by default. To replace missing data in the original DataFrame, set the inplace parameter to True: X_train.fillna(value=median_values, inplace=True).

Now, let’s impute missing values with the median using scikit-learn.

  1. Let’s set up the imputer to replace missing data with the median:
    imputer = SimpleImputer(strategy="median")

Note

To perform mean imputation, set SimpleImputer() as follows: imputer = SimpleImputer(strategy = "mean").

  1. We restrict the imputation to the numerical variables by using ColumnTransformer():
    ct = ColumnTransformer(
        [("imputer", imputer, numeric_vars)],
        remainder="passthrough",
        force_int_remainder_cols=False,
    ).set_output(transform="pandas")

Note

Scikit-learn can return numpy arrays, pandas DataFrames, or polar frames, depending on how we set out the transform output. By default, it returns numpy arrays.

  1. Let’s fit the imputer to the train set so that it learns the median values:
    ct.fit(X_train)
  2. Let’s check out the learned median values:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the median values per variable:

    array([ 28.835,   2.75,   1.,   0., 160.,   6.])
  3. Let’s replace missing values with the median:
    X_train_t = ct.transform(X_train)
    X_test_t = ct.transform(X_test)
  4. Let’s display the resulting training set:
    print(X_train_t.head())

    We see the resulting DataFrame in the following image:

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Figure 1.3 – Training set after the imputation. The imputed variables are marked by the imputer prefix; the untransformed variables show the prefix remainder

Finally, let’s perform median imputation using feature-engine.

  1. Let’s set up the imputer to replace missing data in numerical variables with the median:
    imputer = MeanMedianImputer(
        imputation_method="median",
        variables=numeric_vars,
    )

Note

To perform mean imputation, change imputation_method to "mean". By default MeanMedianImputer() will impute all numerical variables in the DataFrame, ignoring categorical variables. Use the variables argument to restrict the imputation to a subset of numerical variables.

  1. Fit the imputer so that it learns the median values:
    imputer.fit(X_train)
  2. Inspect the learned medians:
    imputer.imputer_dict_

    The previous command returns the median values in a dictionary:

    {'A2': 28.835, 'A3': 2.75, 'A8': 1.0, 'A11': 0.0, 'A14': 160.0, 'A15': 6.0}
  3. Finally, let’s replace the missing values with the median:
    X_train = imputer.transform(X_train)
    X_test = imputer.transform(X_test)

Feature-engine’s MeanMedianImputer() returns a DataFrame. You can check that the imputed variables do not contain missing values using X_train[numeric_vars].isnull().mean().

How it works...

In this recipe, we replaced missing data with the variable’s median values using pandas, scikit-learn, and feature-engine.

We divided the dataset into train and test sets using scikit-learn’s train_test_split() function. The function takes the predictor variables, the target, the fraction of observations to retain in the test set, and a random_state value for reproducibility, as arguments. It returned a train set with 70% of the original observations and a test set with 30% of the original observations. The 70:30 split was done at random.

To impute missing data with pandas, in step 5, we created a dictionary with the numerical variable names as keys and their medians as values. The median values were learned from the training set to avoid data leakage. To replace missing data, we applied pandasfillna() to train and test sets, passing the dictionary with the median values per variable as a parameter.

To replace the missing values with the median using scikit-learn, we used SimpleImputer() with the strategy set to "median". To restrict the imputation to numerical variables, we used ColumnTransformer(). With the remainder argument set to passthrough, we made ColumnTransformer() return all the variables seen in the training set in the transformed output; the imputed ones followed by those that were not transformed.

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

In step 8, we set the output of the column transformer to pandas to obtain a DataFrame as a result. By default, ColumnTransformer() returns numpy arrays.

Note

From version 1.4.0, scikit-learn transformers can return numpy arrays, pandas DataFrames, or polar frames as a result of the transform() method.

With fit(), SimpleImputer() learned the median of each numerical variable in the train set and stored them in its statistics_ attribute. With transform(), it replaced the missing values with the medians.

To replace missing values with the median using Feature-engine, we used the MeanMedianImputer() with the imputation_method set to median. To restrict the imputation to a subset of variables, we passed the variable names in a list to the variables parameter. With fit(), the transformer learned and stored the median values per variable in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values, returning a pandas DataFrame.

Imputing categorical variables

We typically impute categorical variables with the most frequent category, or with a specific string. To avoid data leakage, we find the frequent categories from the train set. Then, we use these values to impute the train, test, and future datasets. scikit-learn and feature-engine find and store the frequent categories for the imputation, out of the box.

In this recipe, we will replace missing data in categorical variables with the most frequent category, or with an arbitrary string.

How to do it...

To begin, let’s make a few imports and prepare the data:

  1. Let’s import pandas and the required functions and classes from scikit-learn and feature-engine:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from feature_engine.imputation import CategoricalImputer
  2. Let’s load the dataset that we prepared in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s split the data into train and test sets and their respective targets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s capture the categorical variables in a list:
    categorical_vars = X_train.select_dtypes(
        include="O").columns.to_list()
  5. Let’s store the variables’ most frequent categories in a dictionary:
    frequent_values = X_train[
        categorical_vars].mode().iloc[0].to_dict()
  6. Let’s replace missing values with the frequent categories:
    X_train_t = X_train.fillna(value=frequent_values)
    X_test_t = X_test.fillna(value=frequent_values)

Note

fillna() returns a new DataFrame with the imputed values by default. We can replace missing data in the original DataFrame by executing X_train.fillna(value=frequent_values, inplace=True).

  1. To replace missing data with a specific string, let’s create an imputation dictionary with the categorical variable names as the keys and an arbitrary string as the values:
    imputation_dict = {var:
         "no_data" for var in categorical_vars}

    Now, we can use this dictionary and the code in step 6 to replace missing data.

Note

With pandas value_counts() we can see the string added by the imputation. Try executing, for example, X_train["A1"].value_counts().

Now, let’s impute missing values with the most frequent category using scikit-learn.

  1. Let’s set up the imputer to find the most frequent category per variable:
    imputer = SimpleImputer(strategy='most_frequent')

Note

SimpleImputer() will learn the mode for numerical and categorical variables alike. But in practice, mode imputation is done for categorical variables only.

  1. Let’s restrict the imputation to the categorical variables:
    ct = ColumnTransformer(
        [("imputer",imputer, categorical_vars)],
        remainder="passthrough"
        ).set_output(transform="pandas")

Note

To impute missing data with a string instead of the most frequent category, set SimpleImputer() as follows: imputer = SimpleImputer(strategy="constant", fill_value="missing").

  1. Fit the imputer to the train set so that it learns the most frequent values:
    ct.fit(X_train)
  2. Let’s take a look at the most frequent values learned by the imputer:
    ct.named_transformers_.imputer.statistics_

    The previous command returns the most frequent values per variable:

    array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)
  3. Finally, let’s replace missing values with the frequent categories:
    X_train_t = ct.transform(X_train)
    X_test_t = ct.transform(X_test)

    Make sure to inspect the resulting DataFrames by executing X_train_t.head().

Note

The ColumnTransformer() changes the names of the variables. The imputed variables show the prefix imputer and the untransformed variables the prefix remainder.

Finally, let’s impute missing values using feature-engine.

  1. Let’s set up the imputer to replace the missing data in categorical variables with their most frequent value:
    imputer = CategoricalImputer(
        imputation_method="frequent",
        variables=categorical_vars,
    )

Note

With the variables parameter set to None, CategoricalImputer() will automatically impute all categorical variables found in the train set. Use this parameter to restrict the imputation to a subset of categorical variables, as shown in step 13.

  1. Fit the imputer to the train set so that it learns the most frequent categories:
    imputer.fit(X_train)

Note

To impute categorical variables with a specific string, set imputation_method to missing and fill_value to the desired string.

  1. Let’s check out the learned categories:
    imputer.imputer_dict_

    We can see the dictionary with the most frequent values in the following output:

    {'A1': 'b',
     'A4': 'u',
     'A5': 'g',
     'A6': 'c',
     'A7': 'v',
     'A9': 't',
     'A10': 'f',
     'A12': 'f',
     'A13': 'g'}
  2. Finally, let’s replace the missing values with frequent categories:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

    If you want to impute numerical variables with a string or the most frequent value using CategoricalImputer(), set the ignore_format parameter to True.

CategoricalImputer() returns a pandas DataFrame as a result.

How it works...

In this recipe, we replaced missing values in categorical variables with the most frequent categories or an arbitrary string. We used pandas, scikit-learn, and feature-engine.

In step 5, we created a dictionary with the variable names as keys and the frequent categories as values. To capture the frequent categories, we used pandas mode(), and to return a dictionary, we used pandas to_dict(). To replace the missing data, we used pandas fillna(), passing the dictionary with the variables and their frequent categories as parameters. There can be more than one mode in a variable, that’s why we made sure to capture only one of those values by using .iloc[0].

To replace the missing values using scikit-learn, we used SimpleImputer() with the strategy set to most_frequent. To restrict the imputation to categorical variables, we used ColumnTransformer(). With remainder set to passthrough, we made ColumnTransformer() return all the variables present in the training set as a result of the transform() method .

Note

ColumnTransformer() changes the names of the variables in the output. The transformed variables show the prefix imputer and the unchanged variables show the prefix remainder.

With fit(), SimpleImputer() learned the variables’ most frequent categories and stored them in its statistics_ attribute. With transform(), it replaced the missing data with the learned parameters.

SimpleImputer() and ColumnTransformer() return NumPy arrays by default. We can change this behavior with the set_output() parameter.

To replace missing values with feature-engine, we used the CategoricalImputer() with imputation_method set to frequent. With fit(), the transformer learned and stored the most frequent categories in a dictionary in its imputer_dict_ attribute. With transform(), it replaced the missing values with the learned parameters.

Unlike SimpleImputer(), CategoricalImputer() will only impute categorical variables, unless specifically told not to do so by setting the ignore_format parameter to True. In addition, with feature-engine transformers we can restrict the transformations to a subset of variables through the transformer itself. For scikit-learn transformers, we need the additional ColumnTransformer() class to apply the transformation to a subset of the variables.

Replacing missing values with an arbitrary number

We can replace missing data with an arbitrary value. Commonly used values are 999, 9999, or -1 for positive distributions. This method is used for numerical variables. For categorical variables, the equivalent method is to replace missing data with an arbitrary string, as described in the Imputing categorical variables recipe.

When replacing missing values with arbitrary numbers, we need to be careful not to select a value close to the mean, the median, or any other common value of the distribution.

Note

We’d use arbitrary number imputation when data is not missing at random, use non-linear models, or when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data with arbitrary numbers using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by importing the necessary tools and loading the data:

  1. Import pandas and the required functions and classes:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from feature_engine.imputation import ArbitraryNumberImputer
  2. Let’s load the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s separate the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )

    We will select arbitrary values greater than the maximum value of the distribution.

  4. Let’s find the maximum value of four numerical variables:
    X_train[['A2','A3', 'A8', 'A11']].max()

    The previous command returns the following output:

    A2     76.750
    A3     26.335
    A8     28.500
    A11    67.000
    dtype: float64

    We’ll use 99 for the imputation because it is bigger than the maximum values of the numerical variables in step 4.

  5. Let’s make a copy of the original DataFrames:
    X_train_t = X_train.copy()
    X_test_t = X_test.copy()
  6. Now, we replace the missing values with 99:
    X_train_t[["A2", "A3", "A8", "A11"]] = X_train_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)
    X_test_t[["A2", "A3", "A8", "A11"]] = X_test_t[[
        "A2", "A3", "A8", "A11"]].fillna(99)

Note

To impute different variables with different values using pandas fillna(), use a dictionary like this: imputation_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}.

Now, we’ll impute missing values with an arbitrary number using scikit-learn.

  1. Let’s set up imputer to replace missing values with 99:
    imputer = SimpleImputer(strategy='constant', fill_value=99)

Note

If your dataset contains categorical variables, SimpleImputer() will add 99 to those variables as well if any values are missing.

  1. Let’s fit imputer to a slice of the train set containing the variables to impute:
    vars = ["A2", "A3", "A8", "A11"]
    imputer.fit(X_train[vars])
  2. Replace the missing values with 99 in the desired variables:
    X_train_t[vars] = imputer.transform(X_train[vars])
    X_test_t[vars] = imputer.transform(X_test[vars])

    Go ahead and check the lack of missing values by executing X_test_t[["A2", "A3", "A8", "A11"]].isnull().sum().

    To finish, let’s impute missing values using feature-engine.

  3. Let’s set up the imputer to replace missing values with 99 in 4 specific variables:
    imputer = ArbitraryNumberImputer(
        arbitrary_number=99,
        variables=["A2", "A3", "A8", "A11"],
    )

Note

ArbitraryNumberImputer() will automatically select all numerical variables in the train set for imputation if we set the variables parameter to None.

  1. Finally, let’s replace the missing values with 99:
    X_train = imputer.fit_transform(X_train)
    X_test = imputer.transform(X_test)

Note

To impute different variables with different numbers, set up ArbitraryNumberImputer() as follows: ArbitraryNumberImputer(imputater_dict = {"A2": -1, "A3": -1, "A8": 999, "A11": 9999}).

We have now replaced missing data with arbitrary numbers using three different open-source libraries.

How it works...

In this recipe, we replaced missing values in numerical variables with an arbitrary number using pandas, scikit-learn, and feature-engine.

To determine which arbitrary value to use, we inspected the maximum values of four numerical variables using pandas’ max(). We chose 99 because it was greater than the maximum values of the selected variables. In step 5, we used pandas fillna() to replace the missing data.

To replace missing values using scikit-learn, we utilized SimpleImputer(), with the strategy set to constant, and specified 99 in the fill_value argument. Next, we fitted the imputer to a slice of the train set with the numerical variables to impute. Finally, we replaced missing values using transform().

To replace missing values with feature-engine we used ArbitraryValueImputer(), specifying the value 99 and the variables to impute as parameters. Next, we applied the fit_transform() method to replace missing data in the train set and the transform() method to replace missing data in the test set.

Finding extreme values for imputation

Replacing missing values with a value at the end of the variable distribution (extreme values) is like replacing them with an arbitrary value, but instead of setting the arbitrary values manually, the values are automatically selected from the end of the variable distribution.

We can replace missing data with a value that is greater or smaller than most values in the variable. To select a value that is greater, we can use the mean plus a factor of the standard deviation. Alternatively, we can set it to the 75th quantile + IQR × 1.5. IQR stands for inter-quartile range and is the difference between the 75th and 25th quantile. To replace missing data with values that are smaller than the remaining values, we can use the mean minus a factor of the standard deviation, or the 25th quantile – IQR × 1.5.

Note

End-of-tail imputation may distort the distribution of the original variables, so it may not be suitable for linear models.

In this recipe, we will implement end-of-tail or extreme value imputation using pandas and feature-engine.

How to do it...

To begin this recipe, let’s import the necessary tools and load the data:

  1. Let’s import pandas and the required function and class:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from feature_engine.imputation import EndTailImputer
  2. Let’s load the dataset we described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
  3. Let’s capture the numerical variables in a list, excluding the target:
    numeric_vars = [
        var for var in data.select_dtypes(
            exclude="O").columns.to_list()
        if var !="target"
    ]
  4. Let’s split the data into train and test sets, keeping only the numerical variables:
    X_train, X_test, y_train, y_test = train_test_split(
        data[numeric_vars],
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  5. We’ll now determine the IQR:
    IQR = X_train.quantile(0.75) - X_train.quantile(0.25)

    We can visualize the IQR values by executing IQR or print(IQR):

    A2      16.4200
    A3       6.5825
    A8       2.8350
    A11      3.0000
    A14    212.0000
    A15    450.0000
    dtype: float64
  6. Let’s create a dictionary with the variable names and the imputation values:
    imputation_dict = (
        X_train.quantile(0.75) + 1.5 * IQR).to_dict()

Note

If we use the inter-quartile range proximity rule, we determine the imputation values by adding 1.5 times the IQR to the 75th quantile. If variables are normally distributed, we can calculate the imputation values as the mean plus a factor of the standard deviation, imputation_dict = (X_train.mean() + 3 * X_train.std()).to_dict().

  1. Finally, let’s replace the missing data:
    X_train_t = X_train.fillna(value=imputation_dict)
    X_test_t = X_test.fillna(value=imputation_dict)

Note

We can also replace missing data with values at the left tail of the distribution using value = X_train[var].quantile(0.25) - 1.5 * IQR or value = X_train[var].mean() – 3 * X_train[var].std().

To finish, let’s impute missing values using feature-engine.

  1. Let’s set up imputer to estimate a value at the right of the distribution using the IQR proximity rule:
    imputer = EndTailImputer(
        imputation_method="iqr",
        tail="right",
        fold=3,
        variables=None,
    )

Note

To use the mean and standard deviation to calculate the replacement values, set imputation_method="Gaussian". Use left or right in the tail argument to specify the side of the distribution to consider when finding values for the imputation.

  1. Let’s fit EndTailImputer() to the train set so that it learns the values for the imputation:
    imputer.fit(X_train)
  2. Let’s inspect the learned values:
    imputer.imputer_dict_

    The previous command returns a dictionary with the values to use to impute each variable:

    {'A2': 88.18,
     'A3': 27.31,
     'A8': 11.504999999999999,
     'A11': 12.0,
     'A14': 908.0,
     'A15': 1800.0}
  3. Finally, let’s replace the missing values:
    X_train = imputer.transform(X_train)
    X_test = imputer.transform(X_test)

Remember that you can corroborate that the missing values were replaced by using X_train[['A2','A3', 'A8', 'A11', 'A14', 'A15']].isnull().mean().

How it works...

In this recipe, we replaced missing values in numerical variables with a number at the end of the distribution using pandas and feature-engine.

We determined the imputation values according to the formulas described in the introduction to this recipe. We used pandas quantile() to find specific quantile values, or pandas mean() and std() for the mean and standard deviation. With pandas fillna() we replaced the missing values.

To replace missing values with EndTailImputer() from feature-engine, we set distribution to iqr to calculate the values based on the IQR proximity rule. With tail set to right the transformer found the imputation values from the right of the distribution. With fit(), the imputer learned and stored the values for the imputation in a dictionary in the imputer_dict_ attribute. With transform(), we replaced the missing values, returning DataFrames.

Marking imputed values

In the previous recipes, we focused on replacing missing data with estimates of their values. In addition, we can add missing indicators to mark observations where values were missing.

A missing indicator is a binary variable that takes the value 1 or True to indicate whether a value was missing, and 0 or False otherwise. It is common practice to replace missing observations with the mean, median, or most frequent category while simultaneously marking those missing observations with missing indicators. In this recipe, we will learn how to add missing indicators using pandas, scikit-learn, and feature-engine.

How to do it...

Let’s begin by making some imports and loading the data:

  1. Let’s import the required libraries, functions, and classes:
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from feature_engine.imputation import(
        AddMissingIndicator,
        CategoricalImputer,
        MeanMedianImputer
    )
  2. Let’s load and split the dataset described in the Technical requirements section:
    data = pd.read_csv("credit_approval_uci.csv")
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  3. Let’s capture the variable names in a list:
    varnames = ["A1", "A3", "A4", "A5", "A6", "A7", "A8"]
  4. Let’s create names for the missing indicators and store them in a list:
    indicators = [f"{var}_na" for var in varnames]

    If we execute indicators, we will see the names we will use for the new variables: ['A1_na', 'A3_na', 'A4_na', 'A5_na', 'A6_na', 'A7_na', 'A8_na'].

  5. Let’s make a copy of the original DataFrames:
    X_train_t = X_train.copy()
    X_test_t = X_test.copy()
  6. Let’s add the missing indicators:
    X_train_t[indicators] =X_train[
        varnames].isna().astype(int)
    X_test_t[indicators] = X_test[
        varnames].isna().astype(int)

Note

If you want the indicators to have True and False as values instead of 0 and 1, remove astype(int) in step 6.

  1. Let’s inspect the resulting DataFrame:
    X_train_t.head()

    We can see the newly added variables at the right of the DataFrame in the following image:

Figure 1.4 – DataFrame with the missing indicators

Figure 1.4 – DataFrame with the missing indicators

Now, let’s add missing indicators using Feature-engine instead.

  1. Set up the imputer to add binary indicators to every variable with missing data:
    imputer = AddMissingIndicator(
        variables=None, missing_only=True
        )
  2. Fit the imputer to the train set so that it finds the variables with missing data:
    imputer.fit(X_train)

Note

If we execute imputer.variables_, we will find the variables for which missing indicators will be added.

  1. Finally, let’s add the missing indicators:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

    So far, we just added missing indicators. But we still have the missing data in our variables. We need to replace them with numbers. In the rest of this recipe, we will combine the use of missing indicators with mean and mode imputation.

  2. Let’s create a pipeline to add missing indicators to categorical and numerical variables, then impute categorical variables with the most frequent category, and numerical variables with the mean:
    pipe = Pipeline([
        ("indicators",
            AddMissingIndicator(missing_only=True)),
        ("categorical", CategoricalImputer(
            imputation_method="frequent")),
        ("numerical", MeanMedianImputer()),
    ])

Note

feature-engine imputers automatically identify numerical or categorical variables. So there is no need to slice the data or pass the variable names as arguments to the transformers in this case.

  1. Let’s add the indicators and impute missing values:
    X_train_t = pipe.fit_transform(X_train)
    X_test_t = pipe.transform(X_test)

Note

Use X_train_t.isnull().sum() to corroborate that there is no data missing. Execute X_train_t.head() to get a view of the resulting datafame.

Finally, let’s add missing indicators and simultaneously impute numerical and categorical variables with the mean and most frequent categories respectively, utilizing scikit-learn.

  1. Let’s make a list with the names of the numerical and categorical variables:
    numvars = X_train.select_dtypes(
        exclude="O").columns.to_list()
    catvars = X_train.select_dtypes(
        include="O").columns.to_list()
  2. Let’s set up a pipeline to perform mean and frequent category imputation while marking the missing data:
    pipe = ColumnTransformer([
        ("num_imputer", SimpleImputer(
            strategy="mean",
            add_indicator=True),
        numvars),
        ("cat_imputer", SimpleImputer(
            strategy="most_frequent",
            add_indicator=True),
        catvars),
    ]).set_output(transform="pandas")
  3. Now, let’s carry out the imputation:
    X_train_t = pipe.fit_transform(X_train)
    X_test_t = pipe.transform(X_test)

Make sure to explore X_train_t.head() to get familiar with the pipeline’s output.

How it works...

To add missing indicators using pandas, we used isna(), which created a new vector assigning the value of True if there was a missing value or False otherwise. We used astype(int) to convert the Boolean vectors into binary vectors with values 1 and 0.

To add a missing indicator with feature-engine, we used AddMissingIndicator(). With fit() the transformer found the variables with missing data. With transform() it appended the missing indicators to the right of the train and test sets.

To sequentially add missing indicators and then replace the nan values with the most frequent category or the mean, we lined up Feature-engine’s AddMissingIndicator(), CategoricalImputer(), and MeanMedianImputer() within a pipeline. The fit() method from the pipeline made the transformers find the variables with nan and calculate the mean of the numerical variables and the mode of the categorical variables. The transform() method from the pipeline made the transformers add the missing indicators and then replace the missing values with their estimates.

Note

Feature-engine transformations return DataFrames respecting the original names and order of the variables. Scikit-learn’s ColumnTransformer(), on the other hand, changes the variable’s names and order in the resulting data.

Finally, we added missing indicators and replaced missing data with the mean and most frequent category using scikit-learn. We lined up two instances of SimpleImputer(), the first to impute data with the mean and the second to impute data with the most frequent category. In both cases, we set the add_indicator parameter to True to add the missing indicators. We wrapped SimpleImputer() with ColumnTransformer() to specifically modify numerical or categorical variables. Then we used the fit() and transform() methods from the pipeline to train the transformers and then add the indicators and replace the missing data.

When returning DataFrames, ColumnTransformer() changes the names of the variables and their order. Take a look at the result from step 15 by executing X_train_t.head(). You’ll see that the name given to each step of the pipeline is added as a prefix to the variables to flag which variable was modified with each transformer. Then, num_imputer__A2 was returned by the first step of the pipeline, while cat_imputer__A12 was returned by the second step of the pipeline.

There’s more…

Scikit-learn has the MissingIndicator() transformer that just adds missing indicators. Check it out in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html and find an example in the accompanying GitHub repository at https://github.com/PacktPublishing/Python-Feature-engineering-Cookbook-Third-Edition/blob/main/ch01-missing-data-imputation/Recipe-06-Marking-imputed-values.ipynb.

Implementing forward and backward fill

Time series data also show missing values. To impute missing data in time series, we use specific methods. Forward fill imputation involves filling missing values in a dataset with the most recent non-missing value that precedes it in the data sequence. In other words, we carry forward the last seen value to the next valid value. Backward fill imputation involves filling missing values with the next non-missing value that follows it in the data sequence. In other words, we carry the last valid value backward to its preceding valid value.

In this recipe, we will replace missing data in a time series with forward and backward fills.

How to do it...

Let’s begin by importing the required libraries and time series dataset:

  1. Let’s import pandas and matplotlib:
    import matplotlib.pyplot as plt
    import pandas as pd
  2. Let’s load the air passengers dataset that we described in the Technical requirements section and display the first five rows of the time series:
    df = pd.read_csv(
        "air_passengers.csv",
        parse_dates=["ds"],
        index_col=["ds"],
    )
    print(df.head())

    We see the time series in the following output:

                    y
    ds
    1949-01-01  112.0
    1949-02-01  118.0
    1949-03-01  132.0
    1949-04-01  129.0
    1949-05-01  121.0

Note

You can determine the percentage of missing data by executing df.isnull().mean().

  1. Let’s plot the time series to spot any obvious data gaps:
    ax = df.plot(marker=".", figsize=[10, 5], legend=None)
    ax.set_title("Air passengers")
    ax.set_ylabel("Number of passengers")
    ax.set_xlabel("Time")

    The previous code returns the following plot, where we see intervals of time where data is missing:

Figure 1.5 – Time series data showing missing values

Figure 1.5 – Time series data showing missing values

  1. Let’s impute missing data by carrying the last observed value in any interval to the next valid value:
    df_imputed = df.ffill()

    You can verify the absence of missing data by executing df_imputed.isnull().sum().

  2. Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
    ax = df_imputed.plot(
        linestyle="-", marker=".", figsize=[10, 5])
    df_imputed[df.isnull()].plot(
        ax=ax, legend=None, marker=".", color="r")
    ax.set_title("Air passengers")
    ax.set_ylabel("Number of passengers")
    ax.set_xlabel("Time")

    The previous code returns the following plot, where we see the values used to replace nan as dotted lines overlaid in between the continuous time series lines:

Figure 1.6 – Time series data where missing values were replaced by the last seen observations (dotted line)

Figure 1.6 – Time series data where missing values were replaced by the last seen observations (dotted line)

  1. Alternatively, we can impute missing data using backward fill:
    df_imputed = df.bfill()

    If we plot the imputed dataset and overlay the imputation values as we did in step 5, we’ll see the following plot:

Figure 1.7 – Time series data where missing values were replaced by backward fill (dotted line)

Figure 1.7 – Time series data where missing values were replaced by backward fill (dotted line)

Note

The heights of the values used in the imputation are different in Figures 1.6 and 1.7. In Figure 1.6, we carry the last value forward, hence the height is lower. In Figure 1.7, we carry the next value backward, hence the height is higher.

We’ve now obtained complete datasets that we can use for time series analysis and modeling.

How it works...

pandas ffill() takes the last seen value in any temporal gap in a time series and propagates it forward to the next observed value. Hence, in Figure 1.6 we see the dotted overlay corresponding to the imputation values at the height of the last seen observation.

pandas bfill() takes the next valid value in any temporal gap in a time series and propagates it backward to the previously observed value. Hence, in Figure 1.7 we see the dotted overlay corresponding to the imputation values at the height of the next observation in the gap.

By default, ffill() and bfill() will impute all values between valid observations. We can restrict the imputation to a maximum number of data points in any interval by setting a limit, using the limit parameter in both methods. For example, ffill(limit=10) will only replace the first 10 data points in any gap.

Carrying out interpolation

We can impute missing data in time series by using interpolation between two non-missing data points. Interpolation is the estimation of one or more values in a range by means of a function. In linear interpolation, we fit a linear function between the last observed value and the next valid point. In spline interpolation, we fit a low-degree polynomial between the last and next observed values. The idea of using interpolation is to obtain better estimates of the missing data.

In this recipe, we’ll carry out linear and spline interpolation in a time series.

How to do it...

Let’s begin by importing the required libraries and time series dataset.

  1. Let’s import pandas and matplotlib:
    import matplotlib.pyplot as plt
    import pandas as pd
  2. Let’s load the time series data described in the Technical requirements section:
    df = pd.read_csv(
        "air_passengers.csv",
        parse_dates=["ds"],
        index_col=["ds"],
    )

Note

You can plot the time series to find data gaps as we did in step 3 of the Implementing forward and backward fill recipe.

  1. Let’s impute the missing data by linear interpolation:
    df_imputed = df.interpolate(method="linear")

Note

If the time intervals between rows are not uniform then the method should be set to time to achieve a linear fit.

You can verify the absence of missing data by executing df_imputed.isnull().sum().

  1. Let’s now plot the complete dataset and overlay as a dotted line the values used for the imputation:
    ax = df_imputed.plot(
        linestyle="-", marker=".", figsize=[10, 5])
    df_imputed[df.isnull()].plot(
        ax=ax, legend=None, marker=".", color="r")
    ax.set_title("Air passengers")
    ax.set_ylabel("Number of passengers")
    ax.set_xlabel("Time")

    The previous code returns the following plot, where we see the values used to replace nan as dotted lines in between the continuous line of the time series:

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

Figure 1.8 – Time series data where missing values were replaced by linear interpolation between the last and next valid data points (dotted line)

  1. Alternatively, we can impute missing data by doing spline interpolation. We’ll use a polynomial of the second degree:
    df_imputed = df.interpolate(method="spline", order=2)

    If we plot the imputed dataset and overlay the imputation values as we did in step 4, we’ll see the following plot:

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Figure 1.9 – Time series data where missing values were replaced by fitting a second-degree polynomial between the last and next valid data points (dotted line)

Note

Change the degree of the polynomial used in the interpolation to see how the replacement values vary.

We’ve now obtained complete datasets that we can use for analysis and modeling.

How it works...

pandas interpolate() fills missing values in a range by using an interpolation method. When we set the method to linear, interpolate() treats all data points as equidistant and fits a line between the last and next valid points in an interval with missing data.

Note

If you want to perform linear interpolation, but your data points are not equally distanced, set method to time.

We then performed spline interpolation with a second-degree polynomial by setting method to spline and order to 2.

pandas interpolate() uses scipy.interpolate.interp1d and scipy.interpolate.UnivariateSpline under the hood, and can therefore implement other interpolation methods. Check out pandas documentation for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html.

See also

While interpolation aims to get better estimates of the missing data compared to forward and backward fill, these estimates may still not be accurate if the times series show strong trend and seasonality. To obtain better estimates of the missing data in these types of time series, check out time series decomposition followed by interpolation in the Feature Engineering for Time Series Course at https://www.trainindata.com/p/feature-engineering-for-forecasting.

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use multiple variables to estimate the missing values. Multivariate Imputation by Chained Equations (MICE) models each variable with missing values as a function of the remaining variables in the dataset. The output of that function is used to replace missing data.

MICE involves the following steps:

  1. First, it performs a simple univariate imputation to every variable with missing data. For example, median imputation.
  2. Next, it selects one specific variable, say, var_1, and sets the missing values back to missing.
  3. It trains a model to predict var_1 using the other variables as input features.
  4. Finally, it replaces the missing values of var_1 with the output of the model.

MICE repeats steps 2 to 4 for each of the remaining variables.

An imputation cycle concludes once all the variables have been modeled. MICE carries out multiple imputation cycles, typically 10. That is, we repeat steps 2 to 4 for each variable 10 times. The idea is that by the end of the cycles, we should have found the best possible estimates of the missing data for each variable.

Note

Multivariate imputation can be a useful alternative to univariate imputation in situations where we don’t want to distort the variable distributions. Multivariate imputation is also useful when we are interested in having good estimates of the missing data.

In this recipe, we will implement MICE using scikit-learn.

How to do it...

To begin the recipe, let’s import the required libraries and load the data:

  1. Let’s import the required Python libraries, classes, and functions:
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import BayesianRidge
    from sklearn.experimental import (
        enable_iterative_imputer
    )
    from sklearn.impute import (
        IterativeImputer,
        SimpleImputer
    )
  2. Let’s load some numerical variables from the dataset described in the Technical requirements section:
    variables = [
        "A2", "A3", "A8", "A11", "A14", "A15", "target"]
    data = pd.read_csv(
        "credit_approval_uci.csv",
        usecols=variables)
  3. Let’s divide the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s create a MICE imputer using Bayes regression, specifying the number of iteration cycles and setting random_state for reproducibility:
    imputer = IterativeImputer(
        estimator= BayesianRidge(),
        max_iter=10,
        random_state=0,
    ).set_output(transform="pandas")

Note

IterativeImputer() contains other useful arguments. For example, we can specify the first imputation strategy using the initial_strategy parameter. We can choose from the mean, median, mode, or arbitrary imputation. We can also specify how we want to cycle over the variables, either randomly or from the one with the fewest missing values to the one with the most.

  1. Let’s fit IterativeImputer() so that it trains the estimators to predict the missing values in each variable:
    imputer.fit(X_train)

Note

We can use any regression model to estimate the missing data with IterativeImputer().

  1. Finally, let’s fill in the missing values in both the train and test sets:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

Note

To corroborate the lack of missing data, we can execute X_train_t.isnull().sum().

To wrap up the recipe, let’s impute the variables with a simple univariate imputation method and compare the effect on the variables’ distribution.

  1. Let’s set up scikit-learn’s SimpleImputer() to perform mean imputation, and then transform the datasets:
    imputer_simple = SimpleImputer(
        strategy="mean").set_output(transform="pandas")
    X_train_s = imputer_simple.fit_transform(X_train)
    X_test_s = imputer_simple.transform(X_test)
  2. Let’s now make a histogram of the A3 variable after MICE imputation, followed by a histogram of the same variable after mean imputation:
    fig, axes = plt.subplots(
        2, 1, figsize=(10, 10), squeeze=False)
    X_test_t["A3"].hist(
        bins=50, ax=axes[0, 0], color="blue")
    X_test_s["A3"].hist(
        bins=50, ax=axes[1, 0], color="green")
    axes[0, 0].set_ylabel('Number of observations')
    axes[1, 0].set_ylabel('Number of observations')
    axes[0, 0].set_xlabel('A3')
    axes[1, 0].set_xlabel('A3')
    axes[0, 0].set_title('MICE')
    axes[1, 0].set_title('Mean imputation')
    plt.show()

    In the following plot, we see that mean imputation distorts the variable distribution, with more observations toward the mean value:

Figure 1.10 –  Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

Figure 1.10 – Histogram of variable A3 after mice imputation (top) or mean imputation (bottom), showing the distortion in the variable distribution caused by the latter

How it works...

In this recipe, we performed multivariate imputation using IterativeImputer() from scikit-learn. When we fit the model, IterativeImputer() carried out the steps that we described in the introduction of the recipe. That is, it imputed all variables with the mean. Then it selected one variable and set its missing values back to missing. And finally, it fitted a Bayes regressor to estimate that variable based on the others. It repeated this procedure for each variable. That was one cycle of imputation. We set it to repeat this process 10 times. By the end of this procedure, IterativeImputer() had one Bayes regressor trained to predict the values of each variable based on the other variables in the dataset. With transform(), it uses the predictions of these Bayes models to impute the missing data.

IterativeImputer() can only impute missing data in numerical variables based on numerical variables. If you want to use categorical variables as input, you need to encode them first. However, keep in mind that it will only carry out regression. Hence it is not suitable to estimate missing data in discrete or categorical variables.

See also

To learn more about MICE, take a look at the following resources:

Estimating missing data with nearest neighbors

Imputation with K-Nearest Neighbors (KNN) involves estimating missing values in a dataset by considering the values of their nearest neighbors, where similarity between data points is determined based on a distance metric, such as the Euclidean distance. It assigns the missing value the average of the nearest neighbors’ values, weighted by their distance.

Consider the following data set containing 4 variables (columns) and 11 observations (rows). We want to impute the dark value in the fifth row of the second variable. First, we find the row’s k-nearest neighbors, where k=3 in our example, and they are highlighted by the rectangular boxes (middle panel). Next, we take the average value shown by the closest neighbors for variable 2.

Figure 1.11 – Diagram showing a value to impute (dark box), the three closest rows to the value to impute (square boxes), and the values considered to take the average for the imputation

Figure 1.11 – Diagram showing a value to impute (dark box), the three closest rows to the value to impute (square boxes), and the values considered to take the average for the imputation

The value for the imputation is given by (value1 × w1 + value2 × w2 + value3 × w3) / 3, where w1, w2, and w3 are proportional to the distance of the neighbor to the data to impute.

In this recipe, we will perform KNN imputation using scikit-learn.

How to do it...

To proceed with the recipe, let’s import the required libraries and prepare the data:

  1. Let’s import the required libraries, classes, and functions:
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.impute import KNNImputer
  2. Let’s load the dataset described in the Technical requirements section (only some numerical variables):
    variables = [
        "A2", "A3", "A8", "A11", "A14", "A15", "target"]
    data = pd.read_csv(
        "credit_approval_uci.csv",
        usecols=variables,
    )
  3. Let’s divide the data into train and test sets:
    X_train, X_test, y_train, y_test = train_test_split(
        data.drop("target", axis=1),
        data["target"],
        test_size=0.3,
        random_state=0,
    )
  4. Let’s set up the imputer to replace missing data with the weighted mean of its closest five neighbors:
    imputer = KNNImputer(
        n_neighbors=5, weights="distance",
    ).set_output(transform="pandas")

Note

The replacement values can be calculated as the uniform mean of the k-nearest neighbors, by setting weights to uniform or as the weighted average, as we do in the recipe. The weight is based on the distance of the neighbor to the observation to impute. The nearest neighbors carry more weight.

  1. Find the nearest neighbors:
    imputer.fit(X_train)
  2. Replace the missing values with the weighted mean of the values shown by the neighbors:
    X_train_t = imputer.transform(X_train)
    X_test_t = imputer.transform(X_test)

The result is a pandas DataFrame with the missing data replaced.

How it works...

In this recipe, we replaced missing data with the average value shown by each observation’s k-nearest neighbors. We set up KNNImputer() to find each observation’s five closest neighbors based on the Euclidean distance. The replacement values were estimated as the weighted average of the values shown by the five closest neighbors for the variable to impute. With transform(), the imputer calculated the replacement value and replaced the missing data.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Craft powerful features from tabular, transactional, and time-series data
  • Develop efficient and reproducible real-world feature engineering pipelines
  • Optimize data transformation and save valuable time
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

Streamline data preprocessing and feature engineering in your machine learning project with this third edition of the Python Feature Engineering Cookbook to make your data preparation more efficient. This guide addresses common challenges, such as imputing missing values and encoding categorical variables using practical solutions and open source Python libraries. You’ll learn advanced techniques for transforming numerical variables, discretizing variables, and dealing with outliers. Each chapter offers step-by-step instructions and real-world examples, helping you understand when and how to apply various transformations for well-prepared data. The book explores feature extraction from complex data types such as dates, times, and text. You’ll see how to create new features through mathematical operations and decision trees and use advanced tools like Featuretools and tsfresh to extract features from relational data and time series. By the end, you’ll be ready to build reproducible feature engineering pipelines that can be easily deployed into production, optimizing data preprocessing workflows and enhancing machine learning model performance.

What you will learn

  • Discover multiple methods to impute missing data effectively
  • Encode categorical variables while tackling high cardinality
  • Find out how to properly transform, discretize, and scale your variables
  • Automate feature extraction from date and time data
  • Combine variables strategically to create new and powerful features
  • Extract features from transactional data and time series
  • Learn methods to extract meaningful features from text data

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 30, 2024
Length 396 pages
Edition : 3rd Edition
Language : English
ISBN-13 : 9781835883587
Category :
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want

Product Details

Publication date : Aug 30, 2024
Length 396 pages
Edition : 3rd Edition
Language : English
ISBN-13 : 9781835883587
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Table of Contents

14 Chapters
Preface Chevron down icon Chevron up icon
1. Chapter 1: Imputing Missing Data Chevron down icon Chevron up icon
2. Chapter 2: Encoding Categorical Variables Chevron down icon Chevron up icon
3. Chapter 3: Transforming Numerical Variables Chevron down icon Chevron up icon
4. Chapter 4: Performing Variable Discretization Chevron down icon Chevron up icon
5. Chapter 5: Working with Outliers Chevron down icon Chevron up icon
6. Chapter 6: Extracting Features from Date and Time Variables Chevron down icon Chevron up icon
7. Chapter 7: Performing Feature Scaling Chevron down icon Chevron up icon
8. Chapter 8: Creating New Features Chevron down icon Chevron up icon
9. Chapter 9: Extracting Features from Relational Data with Featuretools Chevron down icon Chevron up icon
10. Chapter 10: Creating Features from a Time Series with tsfresh Chevron down icon Chevron up icon
11. Chapter 11: Extracting Features from Text Variables Chevron down icon Chevron up icon
12. Index Chevron down icon Chevron up icon
13. Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.