Reader small image

You're reading from  Interpretable Machine Learning with Python - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781803235424
Edition2nd Edition
Right arrow
Author (1)
Serg Masís
Serg Masís
author image
Serg Masís

Serg Masís has been at the confluence of the internet, application development, and analytics for the last two decades. Currently, he's a climate and agronomic data scientist at Syngenta, a leading agribusiness company with a mission to improve global food security. Before that role, he co-founded a start-up, incubated by Harvard Innovation Labs, that combined the power of cloud computing and machine learning with principles in decision-making science to expose users to new places and events. Whether it pertains to leisure activities, plant diseases, or customer lifetime value, Serg is passionate about providing the often-missing link between data and decision-making—and machine learning interpretation helps bridge this gap robustly.
Read more about Serg Masís

Right arrow

Feature Selection and Engineering for Interpretability

In the first three chapters, we discussed how complexity hinders Machine Learning (ML) interpretability. There’s a trade-off because you may need some complexity to maximize predictive performance, yet not to the extent that you cannot rely on the model to satisfy the tenets of interpretability: fairness, accountability, and transparency. This chapter is the first of four focused on how to tune for interpretability. One of the easiest ways to improve interpretability is through feature selection. It has many benefits, such as faster training and making the model easier to interpret. But if these two reasons don’t convince you, perhaps another one will.

A common misunderstanding is that complex models can self-select features and perform well nonetheless, so why even bother to select features? Yes, many model classes have mechanisms that can take care of useless features, but they aren’t perfect. And the...

Technical requirements

This chapter’s example uses the mldatasets, pandas, numpy, scipy, mlxtend, sklearn-genetic-opt, xgboost, sklearn, matplotlib, and seaborn libraries. Instructions on how to install all these libraries are in the Preface.

The GitHub code for this chapter is located here: https://packt.link/1qP4P.

The mission

It has been estimated that there are over 10 million non-profits worldwide, and while a large portion of them have public funding, most of them depend mostly on private donors, both corporate and individual, to continue operations. As such, fundraising is mission-critical and carried out throughout the year.

Year over year, donation revenue has grown, but there are several problems non-profits face: donor interests evolve, so a charity popular one year might be forgotten the next; competition is fierce between non-profits, and demographics are shifting. In the United States, the average donor only gives two charitable gifts per year and is over 64 years old. Identifying potential donors is challenging, and campaigns to reach them can be expensive.

A National Veterans Organization non-profit arm has a large mailing list of about 190,000 past donors and would like to send a special mailer to ask for donations. However, even with a special bulk discount rate, it costs...

The approach

You’ve decided to first fit a base model with all the features and assess it at different levels of complexity to understand the relationship between the increased number of features and the propensity for the predictive model to overfit to the training data. Then, you will employ a series of feature selection methods ranging from simple filter-based methods to the most advanced ones to determine which one achieves the profitability and reliability goals sought by the client. Lastly, once a list of final features has been selected, you can try feature engineering.

Given the cost-sensitive nature of the problem, thresholds are important to optimize the profit lift. We will get into the role of thresholds later on, but one significant effect is that even though this is a classification problem, it is best to use regression models, and then use predictions to classify so that there’s only one threshold to tune. That is, for classification models, you would...

The preparations

The code for this example can be found at https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python-2E/blob/main/10/Mailer.ipynb.

Loading the libraries

To run this example, we need to install the following libraries:

  • mldatasets to load the dataset
  • pandas, numpy, and scipy to manipulate it
  • mlxtend, sklearn-genetic-opt, xgboost, and sklearn (scikit-learn) to fit the models
  • matplotlib and seaborn to create and visualize the interpretations

To load the libraries, use the following code block:

import math
import os
import mldatasets
import pandas as pd
import numpy as np
import timeit
from tqdm.notebook import tqdm
from sklearn.feature_selection import VarianceThreshold,\
                                    mutual_info_classif, SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression,\
                                    LassoCV, LassoLarsCV...

Understanding the effect of irrelevant features

Feature selection is also known as variable or attribute selection. It is the method by which you can automatically or manually select a subset of specific features useful to the construction of ML models.

It’s not necessarily true that more features lead to better models. Irrelevant features can impact the learning process, leading to overfitting. Therefore, we need some strategies to remove any features that might adversely affect learning. Some of the advantages of selecting a smaller subset of features include the following:

  • It’s easier to understand simpler models: For instance, feature importance for a model that uses 15 variables is much easier to grasp than one that uses 150 variables.
  • Shorter training time: Reducing the number of variables decreases the cost of computing, speeds up model training, and perhaps most notably, simpler models have quicker inference times.
  • Improved generalization...

Reviewing filter-based feature selection methods

Filter-based methods independently select features from a dataset without employing any ML. These methods depend only on the variables’ characteristics and are relatively effective, computationally inexpensive, and quick to perform. Therefore, being the low-hanging fruit of feature selection methods, they are usually the first step in any feature selection pipeline.

Filter-based methods can be categorized as:

  • Univariate: Individually and independently of the feature space, they evaluate and rate a single feature at a time. One problem that can occur with univariate methods is that they may filter out too much since they don’t take into consideration the relationship between features.
  • Multivariate: These take into account the entire feature space and how features interact with each other.

Overall, for the removal of obsolete, redundant, constant, duplicated, and uncorrelated features, filter...

Exploring embedded feature selection methods

Embedded methods exist within models themselves by naturally selecting features during training. You can leverage the intrinsic properties of any model that has them to capture the features selected:

  • Tree-based models: For instance, we have used the following code many times to count the number of features used by the RF models, which is evidence of feature selection naturally occurring in the learning process:
    sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)
    

    XGBoost’s RF uses gain by default, which is the average decrease in error in all splits where it used the feature to compute feature importance. We can increase the threshold above 0 to select even fewer features according to their relative contribution. However, by constraining the trees’ depth, we forced the model to choose even fewer features already.

  • Regularized models with coefficients: We will...

Discovering wrapper, hybrid, and advanced feature selection methods

The feature selection methods studied so far are computationally inexpensive because they require no model fitting or fitting simpler white-box models. In this section, we will learn about other, more exhaustive methods with many possible tuning options. The categories of methods included here are as follows:

  • Wrapper: Exhaustively searches for the best subset of features by fitting an ML model using a search strategy that measures improvement on a metric.
  • Hybrid: A method that combines embedded and filter methods with wrapper methods.
  • Advanced: A method that doesn’t fall into any of the previously discussed categories. Examples include dimensionality reduction, model-agnostic feature importance, and Genetic Algorithms (GAs).

And now, let’s get started with wrapper methods!

Wrapper methods

The concept behind wrapper methods is reasonably simple: evaluate different...

Considering feature engineering

Let’s assume that the non-profit has chosen to use the model whose features were selected with LASSO LARS with AIC (e-llarsic) but would like to evaluate whether you can improve it further. Now that you have removed over 300 features that might have only marginally improved predictive performance but mostly added noise, you are left with more relevant features. However, you also know that 8 features selected by e-llars produced the same amount of RMSE as the 111 features. This means that while there’s something in those extra features that improves profitability, it does not improve the RMSE.

From a feature selection standpoint, many things can be done to approach this problem. For instance, examine the overlap and difference of features between e-llarsic and e-llars, and do feature selection variations strictly on those features to see whether the RMSE dips on any combination while keeping or improving on current profitability. However...

Mission accomplished

To approach this mission, you have reduced overfitting using primarily the toolset of feature selection. The non-profit is pleased with a profit lift of roughly 30%, costing a total of $35,601, which is $30,000 less than it would cost to send everyone in the test dataset the mailer. However, they still want assurance that they can safely employ this model without worries that they’ll experience losses.

In this chapter, we’ve examined how overfitting can cause the profitability curves not to align. Misalignment is critical because it could mean that choosing a threshold based on training data would not be reliable on out-of-sample data. So, you use compare_df_plots to compare profitability between the test and train sets as you’ve done before, but this time, for the chosen model (rf_5_e-llarsic):

profits_test = reg_mdls['rf_5_e-llarsic']['profits_test']
profits_train = reg_mdls['rf_5_e-llarsic']['profits_train...

Summary

In this chapter, we have learned about how irrelevant features impact model outcomes and how feature selection provides a toolset to solve this problem. We then explored many different methods in this toolset, from the most basic filter methods to the most advanced ones. Lastly, we broached the subject of feature engineering for interpretability. Feature engineering can make for a more interpretable model that will perform better. We will cover this topic in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability.

In the next chapter, we will discuss methods for bias mitigation and causal inference.

Dataset sources

Further reading

Learn more on Discord

To join the Discord community for this book – where you...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Interpretable Machine Learning with Python - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781803235424
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Serg Masís

Serg Masís has been at the confluence of the internet, application development, and analytics for the last two decades. Currently, he's a climate and agronomic data scientist at Syngenta, a leading agribusiness company with a mission to improve global food security. Before that role, he co-founded a start-up, incubated by Harvard Innovation Labs, that combined the power of cloud computing and machine learning with principles in decision-making science to expose users to new places and events. Whether it pertains to leisure activities, plant diseases, or customer lifetime value, Serg is passionate about providing the often-missing link between data and decision-making—and machine learning interpretation helps bridge this gap robustly.
Read more about Serg Masís