Dataset sources
- United States Department of Transportation Bureau of Transportation Statistics. (2018). Airline On-Time Performance Data. Originally retrieved from https://www.transtats.bts.gov.
The used car market in the United States is a thriving and substantial industry with significant economic impact. In recent years, approximately 40 million used light vehicles have been sold yearly, representing over two-thirds of the overall yearly sales in the automotive sector. In addition, the market has witnessed consistent growth, driven by the rising cost of new vehicles, longer-lasting cars, and an increasing consumer preference for pre-owned vehicles due to the perception of value for money. As a result, this market segment has become increasingly important for businesses and consumers.
Given the market opportunity, a tech startup is currently working on a machine-learning-driven, two-sided marketplace for used car sales. It plans to work much like the e-commerce site eBay, except it’s focused on cars. For example, sellers can list their cars at a fixed price or auction them, and buyers can either pay the higher fixed price or participate in the auction...
You have decided to take the following steps:
The plots will help you communicate findings to the tech startup executives and your data science colleagues.
You will find the code for this example here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python-2E/tree/main/04/UsedCars.ipynb
To run this example, you need to install the following libraries:
mldatasets
to load the datasetpandas
and numpy
to manipulate itsklearn
(scikit-learn) and catboost
to load and configure the modelmatplotlib
, seaborn
, shap
, pdpbox
, and pyale
to generate and visualize the model interpretationsYou should load all of them first:
import math
import os, random
import numpy as np
import pandas as pd
import mldatasets
from sklearn import metrics, ensemble, tree, inspection,\
model_selection
import catboost as cb
import matplotlib.pyplot as plt
import seaborn as sns
import shap
from pdpbox import pdp, info_plots
from PyALE import ale
from lime.lime_tabular import LimeTabularExplainer
The following snippet of code will load...
The following code snippet will train two classifiers, CatBoost and Random Forest:
cb_mdl = cb.CatBoostRegressor(
depth=7, learning_rate=0.2, random_state=rand, verbose=False
)
cb_mdl = cb_mdl.fit(X_train, y_train)
rf_mdl =ensemble.RandomForestRegressor(n_jobs=-1,random_state=rand)
rf_mdl = rf_mdl.fit(X_train.to_numpy(), y_train.to_numpy())
Next, we can evaluate the CatBoost model using a regression plot, and a few metrics. Run the following code, which will output Figure 4.1:
mdl = cb_mdl
y_train_pred, y_test_pred = mldatasets.evaluate_reg_mdl(
mdl, X_train, X_test, y_train, y_test
)
The CatBoost model produced a high R-squared of 0.94 and a test RMSE of nearly 3,100. The regression plot in Figure 4.1 tells us that although there are quite a few cases that have an extremely high error, the vast majority of the 64,000 test samples were predicted fairly well. You can confirm this by running the following code:
thresh = 4000...
Feature importance refers to the extent to which each feature contributes to the final output of a model. For linear models, it’s easier to determine the importance since coefficients clearly indicate the contributions of each feature. However, this isn’t always the case for non-linear models.
To simplify the concept, let’s compare model classes to various team sports. In some sports, it’s easy to identify the players who have the greatest impact on the outcome, while in others, it isn’t. Let’s consider two sports as examples:
Model-agnostic methods imply that we will not depend on intrinsic model parameters to compute feature importance. Instead, we will consider the model as a black box, with only the inputs and output visible. So, how can we determine which inputs made a difference?
What if we altered the inputs randomly? Indeed, one of the most effective methods for evaluating feature importance is through simulations designed to measure a feature’s impact or lack thereof. In other words, let’s remove a random player from the game and observe the outcome! In this section, we will discuss two ways to achieve this: permutation feature importance and SHAP.
Once we have a trained model, we cannot remove a feature to assess the impact of not using it. However, we can:
Previously, we covered the concept of global explanations and SHAP values. But we didn’t demonstrate the many ways we can visualize them. As you will learn, SHAP values are very versatile and can be used to examine much more than feature importance!
But first, we must initialize a SHAP explainer. In the previous chapter, we generated the SHAP values using shap.TreeExplainer
and shap.KernelExplainer
. This time, we will use SHAP’s newer interface, which simplifies the process by saving SHAP values and corresponding data in a single object and much more! Instead of explicitly defining the type of explainer, you initialize it with shap.Explainer(model)
, which returns the callable object. Then, you load your test dataset (X_test
) into the callable Explainer
, and it returns an Explanation
object:
cb_explainer = shap.Explainer(cb_mdl)
cb_shap = cb_explainer(X_test)
In case you are wondering, how did it know what kind of explainer to...
This section will cover a number of methods used to visualize how an individual feature impacts the outcome.
Partial Dependence Plots (PDPs) display a feature’s relationship with the outcome according to the model. In essence, the PDP illustrates the marginal effect of a feature on the model’s predicted output across all possible values of that feature.
The calculation involves two steps:
year
varies between 1984 and 2022, create copies of each observation with year
values ranging between these two numbers. Then, run the model using these values. This first step can be plotted as the Individual Conditional Expectation (ICE) plot, with simulated values for year
on the X-axis and the model output on the Y-axis, and...Features may not influence predictions independently. For example, as discussed in Chapter 2, Key Concepts of Interpretability, determining obesity based solely on weight isn’t possible. A person’s height or body fat, muscle, and other percentages are needed. Models understand data through correlations, and features are often correlated because they are naturally related, even if they are not linearly related. Interactions are what the model may do with correlated features. For instance, a decision tree may put them in the same branch, or a neural network may arrange its parameters in such a way that it creates interaction effects. This also occurs in our case. Let’s explore this through several feature interaction visualizations.
SHAP comes with a hierarchical clustering method (shap.utils.hclust
) that allows for the grouping of training features based on the “redundancy” between any given...
After reading this chapter, you should understand what model-specific methods to compute feature importance are and their shortcomings. Then, you should have learned how model-agnostic methods’ permutation feature importance and SHAP values are calculated and interpreted. You also learned the most common ways to visualize model explanations. You should know your way around global explanation methods like global summaries, feature summaries, and feature interaction plots and their advantages and disadvantages.
In the next chapter, we will delve into local explanations.