You're reading from Interpretable Machine Learning with Python - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781803235424

Edition2nd Edition

Concepts

Machine Learning

Author (1)

Serg Masís

Feature Selection and Engineering for Interpretability

In the first three chapters, we discussed how complexity hinders Machine Learning (ML) interpretability. There’s a trade-off because you may need some complexity to maximize predictive performance, yet not to the extent that you cannot rely on the model to satisfy the tenets of interpretability: fairness, accountability, and transparency. This chapter is the first of four focused on how to tune for interpretability. One of the easiest ways to improve interpretability is through feature selection. It has many benefits, such as faster training and making the model easier to interpret. But if these two reasons don’t convince you, perhaps another one will.

A common misunderstanding is that complex models can self-select features and perform well nonetheless, so why even bother to select features? Yes, many model classes have mechanisms that can take care of useless features, but they aren’t perfect. And the...

Technical requirements

This chapter’s example uses the mldatasets, pandas, numpy, scipy, mlxtend, sklearn-genetic-opt, xgboost, sklearn, matplotlib, and seaborn libraries. Instructions on how to install all these libraries are in the Preface.

The GitHub code for this chapter is located here: https://packt.link/1qP4P.

The mission

It has been estimated that there are over 10 million non-profits worldwide, and while a large portion of them have public funding, most of them depend mostly on private donors, both corporate and individual, to continue operations. As such, fundraising is mission-critical and carried out throughout the year.

Year over year, donation revenue has grown, but there are several problems non-profits face: donor interests evolve, so a charity popular one year might be forgotten the next; competition is fierce between non-profits, and demographics are shifting. In the United States, the average donor only gives two charitable gifts per year and is over 64 years old. Identifying potential donors is challenging, and campaigns to reach them can be expensive.

A National Veterans Organization non-profit arm has a large mailing list of about 190,000 past donors and would like to send a special mailer to ask for donations. However, even with a special bulk discount rate, it costs...

The approach

You’ve decided to first fit a base model with all the features and assess it at different levels of complexity to understand the relationship between the increased number of features and the propensity for the predictive model to overfit to the training data. Then, you will employ a series of feature selection methods ranging from simple filter-based methods to the most advanced ones to determine which one achieves the profitability and reliability goals sought by the client. Lastly, once a list of final features has been selected, you can try feature engineering.

Given the cost-sensitive nature of the problem, thresholds are important to optimize the profit lift. We will get into the role of thresholds later on, but one significant effect is that even though this is a classification problem, it is best to use regression models, and then use predictions to classify so that there’s only one threshold to tune. That is, for classification models, you would...

The preparations

The code for this example can be found at https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python-2E/blob/main/10/Mailer.ipynb.

Loading the libraries

To run this example, we need to install the following libraries:

mldatasets to load the dataset
pandas, numpy, and scipy to manipulate it
mlxtend, sklearn-genetic-opt, xgboost, and sklearn (scikit-learn) to fit the models
matplotlib and seaborn to create and visualize the interpretations

To load the libraries, use the following code block:

import math
import os
import mldatasets
import pandas as pd
import numpy as np
import timeit
from tqdm.notebook import tqdm
from sklearn.feature_selection import VarianceThreshold,\
                                    mutual_info_classif, SelectKBest
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression,\
                                    LassoCV, LassoLarsCV...

Understanding the effect of irrelevant features

Feature selection is also known as variable or attribute selection. It is the method by which you can automatically or manually select a subset of specific features useful to the construction of ML models.

It’s not necessarily true that more features lead to better models. Irrelevant features can impact the learning process, leading to overfitting. Therefore, we need some strategies to remove any features that might adversely affect learning. Some of the advantages of selecting a smaller subset of features include the following:

It’s easier to understand simpler models: For instance, feature importance for a model that uses 15 variables is much easier to grasp than one that uses 150 variables.
Shorter training time: Reducing the number of variables decreases the cost of computing, speeds up model training, and perhaps most notably, simpler models have quicker inference times.
Improved generalization...

Reviewing filter-based feature selection methods

Filter-based methods independently select features from a dataset without employing any ML. These methods depend only on the variables’ characteristics and are relatively effective, computationally inexpensive, and quick to perform. Therefore, being the low-hanging fruit of feature selection methods, they are usually the first step in any feature selection pipeline.

Filter-based methods can be categorized as:

Univariate: Individually and independently of the feature space, they evaluate and rate a single feature at a time. One problem that can occur with univariate methods is that they may filter out too much since they don’t take into consideration the relationship between features.
Multivariate: These take into account the entire feature space and how features interact with each other.

Overall, for the removal of obsolete, redundant, constant, duplicated, and uncorrelated features, filter...

Exploring embedded feature selection methods

Embedded methods exist within models themselves by naturally selecting features during training. You can leverage the intrinsic properties of any model that has them to capture the features selected:

Tree-based models: For instance, we have used the following code many times to count the number of features used by the RF models, which is evidence of feature selection naturally occurring in the learning process:
```
sum(reg_mdls[mdlname]['fitted'].feature_importances_ > 0)
```
XGBoost’s RF uses gain by default, which is the average decrease in error in all splits where it used the feature to compute feature importance. We can increase the threshold above 0 to select even fewer features according to their relative contribution. However, by constraining the trees’ depth, we forced the model to choose even fewer features already.

Regularized models with coefficients: We will...

Discovering wrapper, hybrid, and advanced feature selection methods

The feature selection methods studied so far are computationally inexpensive because they require no model fitting or fitting simpler white-box models. In this section, we will learn about other, more exhaustive methods with many possible tuning options. The categories of methods included here are as follows:

Wrapper: Exhaustively searches for the best subset of features by fitting an ML model using a search strategy that measures improvement on a metric.
Hybrid: A method that combines embedded and filter methods with wrapper methods.
Advanced: A method that doesn’t fall into any of the previously discussed categories. Examples include dimensionality reduction, model-agnostic feature importance, and Genetic Algorithms (GAs).

And now, let’s get started with wrapper methods!

Wrapper methods

The concept behind wrapper methods is reasonably simple: evaluate different...

Considering feature engineering

Let’s assume that the non-profit has chosen to use the model whose features were selected with LASSO LARS with AIC (e-llarsic) but would like to evaluate whether you can improve it further. Now that you have removed over 300 features that might have only marginally improved predictive performance but mostly added noise, you are left with more relevant features. However, you also know that 8 features selected by e-llars produced the same amount of RMSE as the 111 features. This means that while there’s something in those extra features that improves profitability, it does not improve the RMSE.

From a feature selection standpoint, many things can be done to approach this problem. For instance, examine the overlap and difference of features between e-llarsic and e-llars, and do feature selection variations strictly on those features to see whether the RMSE dips on any combination while keeping or improving on current profitability. However...

Mission accomplished

To approach this mission, you have reduced overfitting using primarily the toolset of feature selection. The non-profit is pleased with a profit lift of roughly 30%, costing a total of $35,601, which is $30,000 less than it would cost to send everyone in the test dataset the mailer. However, they still want assurance that they can safely employ this model without worries that they’ll experience losses.

In this chapter, we’ve examined how overfitting can cause the profitability curves not to align. Misalignment is critical because it could mean that choosing a threshold based on training data would not be reliable on out-of-sample data. So, you use compare_df_plots to compare profitability between the test and train sets as you’ve done before, but this time, for the chosen model (rf_5_e-llarsic):

profits_test = reg_mdls['rf_5_e-llarsic']['profits_test']
profits_train = reg_mdls['rf_5_e-llarsic']['profits_train...

Summary

In this chapter, we have learned about how irrelevant features impact model outcomes and how feature selection provides a toolset to solve this problem. We then explored many different methods in this toolset, from the most basic filter methods to the most advanced ones. Lastly, we broached the subject of feature engineering for interpretability. Feature engineering can make for a more interpretable model that will perform better. We will cover this topic in more detail in Chapter 12, Monotonic Constraints and Model Tuning for Interpretability.

In the next chapter, we will discuss methods for bias mitigation and causal inference.

Dataset sources

Ling, C., and Li, C., 1998, Data Mining for Direct Marketing: Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98). AAAI Press, 73–79: https://dl.acm.org/doi/10.5555/3000292.3000304
UCI Machine Learning Repository, 1998, KDD Cup 1998 Data Data Set: https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1998+Data

Ross, B.C., 2014, Mutual Information between Discrete and Continuous Data Sets. PLoS ONE, 9: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087357
Geurts, P., Ernst, D., and Wehenkel, L., 2006, Extremely randomized trees. Machine Learning, 63(1), 3-42: https://link.springer.com/article/10.1007/s10994-006-6226-1
Abid, A., Balin, M.F., and Zou, J., 2019, Concrete Autoencoders for Differentiable Feature Selection and Reconstruction. ICML: https://arxiv.org/abs/1901.09346
Tan, F., Fu, X., Zhang, Y., and Bourgeois, A.G., 2008, A genetic algorithm-based method for feature subset selection. Soft Computing, 12, 111-120: https://link.springer.com/article/10.1007/s00500-007-0193-8
Calzolari, M., 2020, October 12, manuel-calzolari/sklearn-genetic: sklearn-genetic 0.3.0 (Version 0.3.0). Zenodo: http://doi.org/10.5281/zenodo.4081754

Learn more on Discord

To join the Discord community for this book – where you...

The rest of the chapter is locked

You have been reading a chapter from

Interpretable Machine Learning with Python - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781803235424

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Serg Masís

Serg Masís has been at the confluence of the internet, application development, and analytics for the last two decades. Currently, he's a climate and agronomic data scientist at Syngenta, a leading agribusiness company with a mission to improve global food security. Before that role, he co-founded a start-up, incubated by Harvard Innovation Labs, that combined the power of cloud computing and machine learning with principles in decision-making science to expose users to new places and events. Whether it pertains to leisure activities, plant diseases, or customer lifetime value, Serg is passionate about providing the often-missing link between data and decision-making—and machine learning interpretation helps bridge this gap robustly.
Read more about Serg Masís

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Interpretable Machine Learning with Python - Second Edition

Feature Selection and Engineering for Interpretability

Technical requirements

The mission

The approach

The preparations

Loading the libraries

Understanding the effect of irrelevant features

Reviewing filter-based feature selection methods

Exploring embedded feature selection methods

Discovering wrapper, hybrid, and advanced feature selection methods

Wrapper methods

Considering feature engineering

Mission accomplished

Summary

Dataset sources

Further reading

Learn more on Discord

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook