You're reading from Interpretable Machine Learning with Python - Second Edition

Product type Book

Published in Oct 2023

Publisher Packt

ISBN-13 9781803235424

Pages 606 pages

Edition 2nd Edition

Languages

Concepts

Machine Learning

Author (1):

Serg Masís

Table of Contents (17) Chapters

Preface

Interpretation, Interpretability, and Explainability; and Why Does It All Matter?

Key Concepts of Interpretability

Interpretation Challenges

Global Model-Agnostic Interpretation Methods

Local Model-Agnostic Interpretation Methods

Anchors and Counterfactual Explanations

Visualizing Convolutional Neural Networks

Interpreting NLP Transformers

Interpretation Methods for Multivariate Forecasting and Sensitivity Analysis

Feature Selection and Engineering for Interpretability

Bias Mitigation and Causal Inference Methods

Monotonic Constraints and Model Tuning for Interpretability

Adversarial Robustness

What’s Next for Machine Learning Interpretability?

Other Books You May Enjoy

Index

The preparations

You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/tree/master/Chapter10/Mailer.ipynb.

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset
pandas, numpy, and scipy to manipulate it
mlxtend, sklearn_genetic, xgboost, and sklearn (scikit-learn) to fit the models
matplotlib and seaborn to create and visualize the interpretations

To load the libraries, use the following code block:

import math
import os
import mldatasets
import pandas as pd
import numpy as np
import timeit
from tqdm.notebook import tqdm
from sklearn.feature_selection import VarianceThreshold,\
                                    mutual_info_classif, SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression,\
                                     LassoCV, LassoLarsCV, LassoLarsIC
from mlxtend.feature_selection...

Understanding the effect of irrelevant features

Feature selection is also known as variable or attribute selection. It is the method by which you can automatically or manually select a subset of specific features useful to the construction of ML models.

It's not necessarily true that more features lead to better models. Irrelevant features can impact the learning process, leading to overfitting. Therefore, we need some strategies to remove any features that might adversely affect learning. Some of the advantages of selecting a smaller subset of features include the following:

It's easier to understand simpler models: For instance, feature importance for a model that uses 15 variables is much easier to grasp than one that uses 150 variables.
Shorter training time: Reducing the number of variables decreases the cost of computing, speeds up model training, and perhaps most notably, simpler models have quicker inference times.
Improved generalization by reducing overfitting: Sometimes...

Reviewing filter-based feature selection methods

Filter-based methods independently pick out features from a dataset without employing any ML. These methods depend only on the variables' characteristics and are relatively effective, computationally inexpensive, and quick to perform. Therefore, being the low-hanging fruit of feature selection methods, they are usually the first step in any feature selection pipeline.

Two kinds of filter-based methods exist:

Univariate: Individually and independently of the feature space, they evaluate and rate a single feature at a time. One problem that can occur with univariate methods is that they may filter out too much since they don't take into consideration the relationship between features.
Multivariate: These take into account the entire feature space and how features within interact with each other.

Overall, for the removal of obsolete, redundant, constant, duplicated, and uncorrelated features, filter methods are very strong. However...

Exploring embedded feature selection methods

Embedded methods exist within models themselves by naturally selecting features during training. You can leverage the intrinsic properties of any model that has them to capture the features selected:

Tree-based models: For instance, we have used the following code many times to count the number of features used by the RF models, which is evidence of feature selection naturally occurring in the learning process:

              sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)

XGBoost's RF uses gain by default, which is the average decrease in error in all splits where it used the feature to compute feature importance. We can increase the threshold above 0 to select even fewer features according to this relative contribution. However, by constraining the trees' depth, we forced the model to choose even fewer features already.

Regularized models with coefficients: We will study this further in Chapter 12, Monotonic...

Discovering wrapper, hybrid, and advanced feature selection methods

The feature selection methods studied so far are computationally inexpensive because they require no model fitting or fitting simpler white-box models. In this section, we will learn about other, more exhaustive methods with many possible tuning options. The categories of methods included here are as follows:

Wrapper: Exhaustively look for the best subset of features by fitting an ML model using a search strategy that measures improvement on a metric.
Hybrid: A method that combines embedded and filter methods with wrapper methods.
Advanced: A method that doesn't fall into any of the previously discussed categories. Examples include dimensionality reduction, model-agnostic feature importance, and GAs.

And now, let's get started with wrapper methods!

Wrapper methods

The concept behind wrapper methods is reasonably simple: evaluate different subsets of features on the ML model and choose the one that achieves...

Hybrid methods

Starting with 435 features, there are over 10⁴² combinations of 27 feature subsets alone! So, you can see how EFS would be impractical on such a large feature space. Therefore, except for EFS on the entire dataset, wrapper methods will invariably take some shortcuts to select the features. Whether you are going forward, backward, or both, as long as you are not assessing every single combination of features, you could easily miss out on the best one.

However, we can leverage the more rigorous, exhaustive search approach of wrapper methods with filter and embedded methods' efficiency. The result of this is hybrid methods. For instance, you could employ filter or embedded methods to derive only the top-10 features and perform EFS or SBS on only those.

Recursive feature elimination

Another, more common approach is something such as SBS, but instead of removing features based on improving a metric alone, using the model's intrinsic parameters to rank the features...

Considering feature engineering

Let's assume that the non-profit has chosen to use the model whose features were selected with Lasso LARS with AIC (e-llarsic) but would like to evaluate whether you can improve it further. Now that you have removed over 300 features that might have only marginally improved predictive performance but mostly added noise, you are left with more relevant features. However, you also know that 8 features selected by e-llars produced the same amount of RMSE as the 111 features. This means that while there's something in those extra features that improves profitability, it does not improve the RMSE.

From a feature selection standpoint, many things can be done to approach this problem. For instance, examine the overlap and difference of features between e-llarsic and e-llars, and do feature selection variations strictly on those features to see whether the RMSE dips on any combination while keeping or improving on current profitability. However, there...

Mission accomplished

To approach this mission, you have reduced overfitting using primarily the toolset of feature selection. The non-profit is pleased with a profit lift of roughly 30%, costing a total of $35,601, which is $30,000 less than it would cost to send everyone in the test dataset the mailer. However, they still want assurance that they can safely employ this model without worries that they'll experience losses.

In this chapter, we've examined how overfitting can cause the profitability curves not to align. Misalignment is critical because it could mean that choosing a threshold based on training data would not be reliable on out-of-sample data. So, you use compare_df_plots to compare profitability between the test and train sets as you've done before, but this time for the chosen model (rf_5_e-llarsic):

profits_test = reg_mdls['rf_5_e-llarsic']['profits_test']
profits_train = reg_mdls['rf_5_e-llarsic']['profits_train&apos...

Summary

In this chapter, we have learned about how irrelevant features impact model outcomes and how feature selection provides a toolset to solve this problem. We then explored many different methods in this toolset, from the most basic filter methods to the most advanced ones. Lastly, we broached the subject of feature engineering for interpretability. Feature engineering can make for a more interpretable model that will perform better. We will cover this topic in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability. In the next chapter, we will discuss methods for bias mitigation and causal inference.

Dataset sources

Ling, C., & Li, C. (1998). Data Mining for Direct Marketing: Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD'98). AAAI Press, 73–79. https://dl.acm.org/doi/10.5555/3000292.3000304
UCI Machine Learning Repository. (1998). KDD Cup 1998 Data Data Set. https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data