Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Adding a missing value indicator variable

A missing indicator is a binary variable that specifies whether a value was missing for an observation (1) or not (0). It is common practice to replace missing observations by the mean, median, or mode while flagging those missing observations with a missing indicator, thus covering two angles: if the data was missing at random, this would be contemplated by the mean, median, or mode imputation, and if it wasn't, this would be captured by the missing indicator. In this recipe, we will learn how to add missing indicators using NumPy, scikit-learn, and Feature-engine.

Getting ready

For an example of the implementation of missing indicators, along with mean imputation, check out the Winning the KDD Cup Orange Challenge with Ensemble Selection article, which was the winning solution in the KDD 2009 cup: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf.

How to do it...

Let's begin by importing the required packages and preparing the data:

  1. Let's import the required libraries, functions and classes:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import MissingIndicator
from feature_engine.missing_data_imputers import AddNaNBinaryImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI.csv')
  1. Let's separate the data into train and test sets:
X_train, X_test, y_train, y_test = train_test_split(
data.drop('A16', axis=1), data['A16'], test_size=0.3,
random_state=0)
  1. Using NumPy, we'll add a missing indicator to the numerical and categorical variables in a loop:
for var in ['A1', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8']:
X_train[var + '_NA'] = np.where(X_train[var].isnull(), 1, 0)
X_test[var + '_NA'] = np.where(X_test[var].isnull(), 1, 0)
Note how we name the new missing indicators using the original variable name, plus _NA.
  1. Let's inspect the result of the preceding code block:
X_train.head()

We can see the newly added variables at the end of the dataframe:

The mean of the new variables and the percentage of missing values in the original variables should be the same, which you can corroborate by executing X_train['A3'].isnull().mean(), X_train['A3_NA'].mean().

Now, let's add missing indicators using Feature-engine instead. First, we need to load and divide the data, just like we did in step 2 and step 3 of this recipe.

  1. Next, let's set up a transformer that will add binary indicators to all the variables in the dataset using AddNaNBinaryImputer() from Feature-engine:
imputer = AddNaNBinaryImputer()
We can specify the variables which should have missing indicators by passing the variable names in a list: imputer = AddNaNBinaryImputer(variables = ['A2', 'A3']). Alternatively, the imputer will add indicators to all the variables.
  1. Let's fit AddNaNBinaryImputer() to the train set:
imputer.fit(X_train)
  1. Finally, let's add the missing indicators:
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
We can inspect the result using X_train.head(); it should be similar to the output of step 5 in this recipe.

We can also add missing indicators using scikit-learn's MissingIndicator() class. To do this, we need to load and divide the dataset, just like we did in step 2 and step 3.

  1. Next, we'll set up a MissingIndicator(). Here, we will add indicators only to variables with missing data:
indicator = MissingIndicator(features='missing-only')
  1. Let's fit the transformer so that it finds the variables with missing data in the train set:
indicator.fit(X_train) 

Now, we can concatenate the missing indicators that were created by MissingIndicator() to the train set.

  1. First, let's create a column name for each of the new missing indicators with a list comprehension:
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]
The features_ attribute contains the indices of the features for which missing indicators will be added. If we pass these indices to the train set column array, we can get the variable names.
  1. Next, let's concatenate the original train set with the missing indicators, which we obtain using the transform method:
X_train = pd.concat([
X_train.reset_index(),
pd.DataFrame(indicator.transform(X_train),
columns = indicator_cols)], axis=1)
Scikit-learn transformers return NumPy arrays, so to concatenate them into a dataframe, we must cast it as a dataframe using pandas DataFrame().

The result of the preceding code block should contain the original variables, plus the indicators.

How it works...

In this recipe, we added missing value indicators to categorical and numerical variables in the Credit Approval Data Set using NumPy, scikit-learn, and Feature-engine. To add missing indicators using NumPy, we used the where() method, which created a new vector after scanning all the observations in a variable, assigning the value of 1 if there was a missing observation or 0 otherwise. We captured the indicators in columns with the name of the original variable, plus _NA.

To add a missing indicator with Feature-engine, we created an instance of AddNaNBinaryImputer() and fitted it to the train set. Then, we used the transform() method to add missing indicators to the train and test sets. Finally, to add missing indicators with scikit-learn, we created an instance of MissingIndicator() so that we only added indicators to variables with missing data. With the fit() method, the transformer identified variables with missing values. With transform(), it returned a NumPy array with binary indicators, which we captured in a dataframe and then concatenated to the original dataframe.

There's more...

We can add missing indicators using scikit-learn's SimpleImputer() by setting the add_indicator argument to True. For example, imputer = SimpleImputer(strategy=’mean’, add_indicator=True) will return a NumPy array with missing indicators, plus the missing values in the original variables were replaced by the mean after using the fit() and transform() methods.

See also

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Python Feature Engineering Cookbook
You have been reading a chapter from
Python Feature Engineering Cookbook
Published in: Jan 2020
Publisher: Packt
ISBN-13: 9781789806311
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime
Modal Close icon
Modal Close icon