Artificial Intelligence with Python Cookbook

Advanced Topics in Supervised Machine Learning

Following the tasters with scikit-learn, Keras, and PyTorch in the previous chapter, in this chapter, we will move on to more end-to-end examples. These examples are more advanced in the sense that they include more complex transformations and model types.

We'll be predicting partner choices with sklearn, where we'll implement a lot of custom transformer steps and more complicated machine learning pipelines. We'll then predict house prices in PyTorch and visualize feature and neuron importance. After that, we will perform active learning to decide customer values together with online learning in sklearn. In the well-known case of repeat offender prediction, we'll build a model without racial bias. Last, but not least, we'll forecast time series of CO₂ levels.

Online learning in this context (as opposed to internet-based learning) refers to a model update strategy that incorporates training data that comes in sequentially. This can be useful in cases where the dataset is very big (often the case with images, videos, and texts) or where it's important to keep the model up to date given the changing nature of the data.

In many of these recipes, we've shortened the description to the most salient details in order to highlight particular concepts. For the full details, please refer to the notebooks on GitHub.

In this chapter, we'll be covering the following recipes:

Transforming data in scikit-learn
Predicting house prices in PyTorch
Live decisioning customer values
Battling algorithmic bias
Forecasting CO₂ time series

Transforming data in scikit-learn

In this recipe, we will be building more complex pipelines using mixed-type columnar data. We'll use a speed dating dataset that was published in 2006 by Fisman et al.: https://doi.org/10.1162/qjec.2006.121.2.673

Perhaps this recipe will be informative in more ways than one, and we'll learn something useful about the mechanics of human mating choices.

The dataset description on the OpenML website reads as follows:

This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute first date with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked whether they would like to see their date again. They were also asked to rate their date on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include demographics, dating habits, self-perception across key attributes, beliefs in terms of what others find valuable in a mate, and lifestyle information.

The problem is to predict mate choices from what we know about participants and their matches. This dataset presents some challenges that can serve an illustrative purpose:

It contains 123 different features, of different types:
- Categorical
- Numerical
- Range features

It also contains the following:

Some missing values
Target imbalance

On the way to solving this problem of predicting mate choices, we will build custom encoders in scikit-learn and a pipeline comprising all features and their preprocessing steps.

The primary focus in this recipe will be on pipelines and transformers. In particular, we will build a custom transformer for working with range features and another one for numerical features.

Getting ready

We'll need the following libraries for this recipe. They are as follows:

OpenML to download the dataset
openml_speed_dating_pipeline_steps to use our custom transformer
imbalanced-learn to work with imbalanced classes
shap to show us the importance of features

In order to install them, we can use pip again:

pip install -q openml openml_speed_dating_pipeline_steps==0.5.5 imbalanced_learn category_encoders shap

OpenML is an organization that intends to make data science and machine learning reproducible and therefore more conducive to research. The OpenML website not only hosts datasets, but also allows the uploading of machine learning results to public leaderboards under the condition that the implementation relies solely on open source. These results and how they were obtained can be inspected in complete detail by anyone who's interested.

In order to retrieve the data, we will use the OpenML Python API. The get_dataset() method will download the dataset; with get_data(), we can get pandas DataFrames for features and target, and we'll conveniently get the information on categorical and numerical feature types:

import openml
dataset = openml.datasets.get_dataset(40536)
X, y, categorical_indicator, _ = dataset.get_data(
  dataset_format='DataFrame',
  target=dataset.default_target_attribute
)
categorical_features = list(X.columns[categorical_indicator]) numeric_features = list(
  X.columns[[not(i) for i in categorical_indicator]]
)

In the original version of the dataset, as presented in the paper, there was a lot more work to do. However, the version of the dataset on OpenML already has missing values represented as numpy.nan, which lets us skip this conversion. You can see this preprocessor on GitHub if you are interested: https://github.com/benman1/OpenML-Speed-Dating

Alternatively, you can use a download link from the OpenML dataset web page at https://www.openml.org/data/get_csv/13153954/speeddating.arff.

With the dataset loaded, and the libraries installed, we are ready to start cracking.

How to do it...

Pipelines are a way of describing how machine learning algorithms, including preprocessing steps, can follow one another in a sequence of transformations on top of the raw dataset before applying a final predictor. We will see examples of these concepts in this recipe and throughout this book.

A few things stand out pretty quickly looking at this dataset. We have a lot of categorical features. So, for modeling, we will need to encode them numerically, as in the Modeling and predicting in Keras recipe in Chapter 1, Getting Started with Artificial Intelligence in Python.

Encoding ranges numerically

Some of these are actually encoded ranges. This means these are ordinal, in other words, they are categories that are ordered; for example, the d_interests_correlate feature contains strings like these:

[[0-0.33], [0.33-1], [-1-0]]

If we were to treat these ranges as categorical variables, we'd lose the information about the order, and we would lose information about how different two values are. However, if we convert them to numbers, we will keep this information and we would be able to apply other numerical transformations on top.

We are going to implement a transformer to plug into an sklearn pipeline in order to convert these range features to numerical features. The basic idea of the conversion is to extract the upper and lower bounds of these ranges as follows:

def encode_ranges(range_str):
  splits = range_str[1:-1].split('-')
  range_max = splits[-1]
  range_min = '-'.join(splits[:-1])
  return range_min, range_max

examples = X['d_interests_correlate'].unique()
[encode_ranges(r) for r in examples]

We'll see this for our example:

[('0', '0.33'), ('0.33', '1'), ('-1', '0')]

In order to get numerical features, we can then take the mean between the two bounds. As we've mentioned before, on OpenML, not only are results shown, but also the models are transparent. Therefore, if we want to submit our model, we can only use published modules. We created a module and published it in the pypi Python package repository, where you can find the package with the complete code: https://pypi.org/project/openml-speed-dating-pipeline-steps/.

Here is the simplified code for RangeTransformer:

from sklearn.base import BaseEstimator, TransformerMixin
import category_encoders.utils as util

class RangeTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, range_features=None, suffix='_range/mean', n_jobs=-1):
    assert isinstance(range_features, list) or range_features is None
    self.range_features = range_features
    self.suffix = suffix
    self.n_jobs = n_jobs

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    X = util.convert_input(X)
    if self.range_features is None:
      self.range_features = list(X.columns)

    range_data = pd.DataFrame(index=X.index)
    for col in self.range_features:
      range_data[str(col) + self.suffix] = pd.to_numeric(
        self._vectorize(X[col])
      )
    self.feature_names = list(range_data.columns)
    return range_data

    def _vectorize(self, s):
        return Parallel(n_jobs=self.n_jobs)(
            delayed(self._encode_range)(x) for x in s
        )

    @staticmethod
    @lru_cache(maxsize=32)
    def _encode_range(range_str):
        splits = range_str[1:-1].split('-')
        range_max = float(splits[-1])
        range_min = float('-'.join(splits[:-1]))
        return sum([range_min, range_max]) / 2.0

    def get_feature_names(self):
        return self.feature_names

This is a shortened snippet of the custom transformer for ranges. Please see the full implementation on GitHub at https://github.com/benman1/OpenML-Speed-Dating.

Please pay attention to how the fit() and transform() methods are used. We don't need to do anything in the fit() method, because we always apply the same static rule. The transfer() method applies this rule. We've seen the examples previously. What we do in the transfer() method is to iterate over columns. This transformer also shows the use of the parallelization pattern typical to scikit-learn. Additionally, since these ranges repeat a lot, and there aren't so many, we'll use a cache so that, instead of doing costly string transformations, the range value can be retrieved from memory once the range has been processed once.

An important thing about custom transformers in scikit-learn is that they should inherit from BaseEstimator and TransformerMixin, and implement the fit() and transform() methods. Later on, we will require get_feature_names() so we can find out the names of the features generated.

Deriving higher-order features

Let's implement another transformer. As you may have noticed, we have different types of features that seem to refer to the same personal attributes:

Personal preferences
Self-assessment
Assessment of the other person

It seems clear that differences between any of these features could be significant, such as the importance of sincerity versus how sincere someone assesses a potential partner. Therefore, our next transformer is going to calculate the differences between numerical features. This is supposed to help highlight these differences.

These features are derived from other features, and combine information from two (or potentially more features). Let's see what the NumericDifferenceTransformer feature looks like:

import operator

class NumericDifferenceTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, features=None,
                 suffix='_numdist', op=operator.sub, n_jobs=-1
                 ):
        assert isinstance(
            features, list
        ) or features is None
        self.features = features
        self.suffix = suffix
        self.op = op
        self.n_jobs = n_jobs

    def fit(self, X, y=None):
        X = util.convert_input(X)
        if self.features is None:
            self.features = list(
                X.select_dtypes(include='number').columns
            )
        return self

    def _col_name(self, col1, col2):
        return str(col1) + '_' + str(col2) + self.suffix

    def _feature_pairs(self):
        feature_pairs = []
        for i, col1 in enumerate(self.features[:-1]):
            for col2 in self.features[i+1:]:
                feature_pairs.append((col1, col2))
        return feature_pairs

    def transform(self, X, y=None):
        X = util.convert_input(X)

        feature_pairs = self._feature_pairs()
        columns = Parallel(n_jobs=self.n_jobs)(
            delayed(self._col_name)(col1, col2)
            for col1, col2 in feature_pairs
        )
        data_cols = Parallel(n_jobs=self.n_jobs)(
            delayed(self.op)(X[col1], X[col2])
            for col1, col2 in feature_pairs
        )
        data = pd.concat(data_cols, axis=1)
        data.rename(
            columns={i: col for i, col in enumerate(columns)},
            inplace=True, copy=False
        )
        data.index = X.index
        return data

    def get_feature_names(self):
        return self.feature_names

This is a custom transformer that calculates differences between numerical features. Please refer to the full implementation in the repository of the OpenML-Speed-Dating library at https://github.com/benman1/OpenML-Speed-Dating.

This transformer has a very similar structure to RangeTransformer. Please note the parallelization between columns. One of the arguments to the __init__() method is the function that is used to calculate the difference. This is operator.sub() by default. The operator library is part of the Python standard library and implements basic operators as functions. The sub() function does what it sounds like:

import operator
operator.sub(1, 2) == 1 - 2
# True

This gives us a prefix or functional syntax for standard operators. Since we can pass functions as arguments, this gives us the flexibility to specify different operators between columns.

The fit() method this time just collects the names of numerical columns, and we'll use these names in the transform() method.

Combining transformations

We will put these transformers together with ColumnTransformer and the pipeline. However, we'll need to make the association between columns and their transformations. We'll define different groups of columns:

range_cols = [
    col for col in X.select_dtypes(include='category')
    if X[col].apply(lambda x: x.startswith('[')
    if isinstance(x, str) else False).any()
]
cat_columns = list(
  set(X.select_dtypes(include='category').columns) - set(range_cols)
)
num_columns = list(
    X.select_dtypes(include='number').columns
)

Now we have columns that are ranges, columns that are categorical, and columns that are numerical, and we can assign pipeline steps to them.

In our case, we put this together as follows, first in a preprocessor:

from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce
import openml_speed_dating_pipeline_steps as pipeline_steps

preprocessor = ColumnTransformer(
 transformers=[
 ('ranges', Pipeline(steps=[
 ('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value=-1)),
 ('encode', pipeline_steps.RangeTransformer())
 ]), range_cols),
 ('cat', Pipeline(steps=[
 ('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value='-1')),
 ('encode', ce.OneHotEncoder(
 cols=None, # all features that it given by ColumnTransformer
 handle_unknown='ignore',
 use_cat_names=True
 )
 )
 ]), cat_columns),
 ('num', pipeline_steps.SimpleImputerWithFeatureNames(strategy='median'), num_columns),
 ],
 remainder='drop', n_jobs=-1
)

And then we'll put the preprocessing in a pipeline, together with the estimator:

def create_model(n_estimators=100):
    return Pipeline(
        steps=[('preprocessor', preprocessor),
               ('numeric_differences', pipeline_steps.NumericDifferenceTransformer()),
               ('feature_selection', SelectKBest(f_classif, k=20)),
               ('rf', BalancedRandomForestClassifier(
                  n_estimators=n_estimators,
                  )
               )]
       )

Here is the performance in the test set:

from sklearn.metrics import roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, y,
  test_size=0.33,
  random_state=42,
  stratify=y
)
clf = create_model(50)
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
auc = roc_auc_score(y_test, y_predicted)
print('auc: {:.3f}'.format(auc))

We get the following performance as an output:

auc: 0.779

This is a very good performance, as you can see comparing it to the leaderboard on OpenML.

How it works...

It is time to explain basic scikit-learn terminology relevant to this recipe. Neither of these concepts corresponds to existing machine learning algorithms, but to composable modules:

Transformer (in scikit-learn): A class that is derived from sklearn.base.TransformerMixin; it has fit() and transform() methods. These involve preprocessing steps or feature selection.
Predictor: A class that is derived from either sklearn.base.ClassifierMixin or sklearn.base.RegressorMixin; it has fit() and predict() methods. These are machine learning estimators, in other words, classifiers or regressors.
Pipeline: An interface that wraps all steps together and gives you a single interface for all steps of the transformation and the resulting estimator. A pipeline again has fit() and predict() methods.

There are a few things to point out regarding our approach. As we said before, we have missing values, so we have to impute (meaning replace) missing values with other values. In this case, we replace missing values with -1. In the case of categorical variables, this will be a new category, while in the case of numerical variables, it will become a special value that the classifier will have to handle.

ColumnTransformer came with version 0.20 of scikit-learn and was a long-awaited feature. Since then, ColumnTransformer can often be seen like this, for example:

from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

feature_preprocessing = make_column_transformer(
  (StandardScaler(), ['column1', 'column2']),
  (OneHotEncoder(), ['column3', 'column4', 'column5']) 
)

feature_preprocessing can then be used as usual with the fit(), transform(), and fit_transform() methods:

processed_features = feature_preprocessing.fit_transform(X)

Here, X means our features.

Alternatively, we can put ColumnTransformer as a step into a pipeline, for example, like this:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

make_pipeline(
    feature_preprocessing,
    LogisticRegression()
)

Our classifier is a modified form of the random forest classifier. A random forest is a collection of decision trees, each trained on random subsets of the training data. The balanced random forest classifier (Chen et al.: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf) makes sure that each random subset is balanced between the two classes.

Since NumericDifferenceTransformer can provide lots of features, we will incorporate an extra step of model-based feature selection.

There's more...

You can see the complete example with the speed dating dataset, a few more custom transformers, and an extended imputation class in the GitHub repository of the openml_speed_dating_pipeline_steps library and notebook, on GitHub at https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-Cookbook/blob/master/chapter02/Transforming%20Data%20in%20Scikit-Learn.ipynb.

Both RangeTransformer and NumericDifferenceTransformer could also have been implemented using FunctionTransformer in scikit-learn.

ColumnTransformer is especially handy for pandas DataFrames or NumPy arrays since it allows the specification of different operations for different subsets of the features. However, another option is FeatureUnion, which allows concatenation of the results from different transformations. For yet another way to chain our operations together, please have a look at PandasPicker in our repository.

Predicting house prices in PyTorch

In this recipe, the aim of the problem is to predict house prices in Ames, Iowa, given 81 features describing the house, area, land, infrastructure, utilities, and much more. The Ames dataset has a nice combination of categorical and continuous features, a good size, and, perhaps most importantly, it doesn't suffer from problems of potential redlining or data entry like other, similar datasets, such as Boston Housing. We'll concentrate on the main aspects of PyTorch modeling here. We'll do online learning, analogous to Keras, in the Modeling and predicting in Keras recipe in Chapter 1, Getting Started with Artificial Intelligence in Python. If you want to see more details on some of the steps, please look at our notebook on GitHub.

As a little extra, we will also demonstrate neuron importance for the models developed in PyTorch. You can try out different network architectures in PyTorch or model types. The focus in this recipe is on the methodology, not an exhaustive search for the best solution.

Getting ready

In order to prepare for the recipe, we need to do a few things. We'll download the data as in the previous recipe, Transforming data in scikit-learn, and perform some preprocessing by following these steps:

from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=42165, as_frame=True)

You can see the full dataset description at OpenML: https://www.openml.org/d/42165.

Let's look at the features:

import pandas as pd
data_ames = pd.DataFrame(data.data, columns=data.feature_names)
data_ames['SalePrice'] = data.target
data_ames.info()

Here is the DataFrame information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null float64
MSSubClass       1460 non-null float64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null float64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null float64
OverallCond      1460 non-null float64
YearBuilt        1460 non-null float64
YearRemodAdd     1460 non-null float64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null float64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null float64
BsmtUnfSF        1460 non-null float64
TotalBsmtSF      1460 non-null float64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null float64
2ndFlrSF         1460 non-null float64
LowQualFinSF     1460 non-null float64
GrLivArea        1460 non-null float64
BsmtFullBath     1460 non-null float64
BsmtHalfBath     1460 non-null float64
FullBath         1460 non-null float64
HalfBath         1460 non-null float64
BedroomAbvGr     1460 non-null float64
KitchenAbvGr     1460 non-null float64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null float64
Functional       1460 non-null object
Fireplaces       1460 non-null float64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null float64
GarageArea       1460 non-null float64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null float64
OpenPorchSF      1460 non-null float64
EnclosedPorch    1460 non-null float64
3SsnPorch        1460 non-null float64
ScreenPorch      1460 non-null float64
PoolArea         1460 non-null float64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null float64
MoSold           1460 non-null float64
YrSold           1460 non-null float64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null float64
dtypes: float64(38), object(43)
memory usage: 924.0+ KB

PyTorch and seaborn are installed by default in Colab. We will assume, even if you are working with your self-hosted install by now, that you'll have the libraries installed.

We'll use one more library, however, captum, which allows the inspection of PyTorch models for feature and neuron importance:

!pip install captum

There is one more thing. We'll assume you have a GPU available. If you don't have a GPU in your computer, we'd recommend you try this recipe on Colab. In Colab, you'll have to choose a runtime type with GPU.

After all these preparations, let's see how we can predict house prices.

How to do it...

The Ames Housing dataset is a small- to mid-sized dataset (1,460 rows) with 81 features, both categorical and numerical. There are no missing values.

In the Keras recipe previously, we've seen how to scale the variables. Scaling is important here because all variables have different scales. Categorical variables need to be converted to numerical types in order to feed them into our model. We have the choice of one-hot encoding, where we create dummy variables for each categorical factor, or ordinal encoding, where we number all factors and replace the strings with these numbers. We could feed the dummy variables in like any other float variable, while ordinal encoding would require the use of embeddings, linear neural network projections that re-order the categories in a multi-dimensional space.

We take the embedding route here:

import numpy as np
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

num_cols = list(data_ames.select_dtypes(include='float'))
cat_cols = list(data_ames.select_dtypes(include='object'))

ordinal_encoder = OrdinalEncoder().fit(
    data_ames[cat_cols]
)
standard_scaler = StandardScaler().fit(
    data_ames[num_cols]
)

X = pd.DataFrame(
    data=np.column_stack([
        ordinal_encoder.transform(data_ames[cat_cols]),
        standard_scaler.transform(data_ames[num_cols])
    ]),
    columns=cat_cols + num_cols
)

We go through the data analysis, such as correlation and distribution plots, in a lot more detail in the notebook on GitHub.

Now we can split the data into training and test sets, as we did in previous recipes. Here, we add a stratification of the numerical variable. This makes sure that different sections (five of them) are included at equal measure in both training and test sets:

np.random.seed(12)  
from sklearn.model_selection import train_test_split

bins = 5
sale_price_bins = pd.qcut(
    X['SalePrice'], q=bins, labels=list(range(bins))
)
X_train, X_test, y_train, y_test = train_test_split(
    X.drop(columns='SalePrice'),
    X['SalePrice'],
    random_state=12,
    stratify=sale_price_bins
)

Before going ahead, let's look at the importance of the features using a model-independent technique.

Before we run anything, however, let's make sure we are running on the GPU:

device = torch.device('cuda')
torch.backends.cudnn.benchmark = True

Let's build our PyTorch model, similar to the Classifying in scikit-learn, Keras, and PyTorch recipe in Chapter 1, Getting Started with Artificial Intelligence in Python.

We'll implement a neural network regression with batch inputs using PyTorch. This will involve the following steps:

Converting data to torch tensors
Defining the model architecture
Defining the loss criterion and optimizer
Creating a data loader for batches
Running the training

Without further preamble, let's get to it:

Begin by converting the data to torch tensors:

from torch.autograd import Variable 

num_features = list(
    set(num_cols) - set(['SalePrice', 'Id'])
)
X_train_num_pt = Variable(
    torch.cuda.FloatTensor(
        X_train[num_features].values
    )
)
X_train_cat_pt = Variable(
    torch.cuda.LongTensor(
        X_train[cat_cols].values
    )
)
y_train_pt = Variable(
    torch.cuda.FloatTensor(y_train.values)
).view(-1, 1)
X_test_num_pt = Variable(
    torch.cuda.FloatTensor(
        X_test[num_features].values
    )
)
X_test_cat_pt = Variable(
   torch.cuda.LongTensor(
        X_test[cat_cols].values
    ).long()
)
y_test_pt = Variable(
    torch.cuda.FloatTensor(y_test.values)
).view(-1, 1)

This makes sure we load our numerical and categorical data into separate variables, similar to NumPy. If you mix data types in a single variable (array/matrix), they'll become objects. We want to get our numerical variables as floats, and the categorical variables as long (or int) indexing our categories. We also separate the training and test sets.

Clearly, an ID variable should not be important in a model. In the worst case, it could introduce a target leak if there's any correlation of the ID with the target. We've removed it from further processing.

Define the model architecture:

class RegressionModel(torch.nn.Module): 
  
    def __init__(self, X, num_cols, cat_cols, device=torch.device('cuda'), embed_dim=2, hidden_layer_dim=2, p=0.5): 
        super(RegressionModel, self).__init__() 
        self.num_cols = num_cols
        self.cat_cols = cat_cols
        self.embed_dim = embed_dim
        self.hidden_layer_dim = hidden_layer_dim
        
        self.embeddings = [
            torch.nn.Embedding(
                num_embeddings=len(X[col].unique()),
                embedding_dim=embed_dim
            ).to(device)
            for col in cat_cols
        ]
        hidden_dim = len(num_cols) + len(cat_cols) * embed_dim,
        
        # hidden layer
        self.hidden = torch.nn.Linear(torch.IntTensor(hidden_dim), hidden_layer_dim).to(device)
        self.dropout_layer = torch.nn.Dropout(p=p).to(device)
        self.hidden_act = torch.nn.ReLU().to(device)
        
        # output layer
        self.output = torch.nn.Linear(hidden_layer_dim, 1).to(device)
    
    def forward(self, num_inputs, cat_inputs):
        '''Forward method with two input variables -
        numeric and categorical.
        '''
        cat_x = [
            torch.squeeze(embed(cat_inputs[:, i] - 1))
            for i, embed in enumerate(self.embeddings)
        ]
        x = torch.cat(cat_x + [num_inputs], dim=1)
        x = self.hidden(x)
        x = self.dropout_layer(x)
        x = self.hidden_act(x)
        y_pred = self.output(x)
        return y_pred

house_model = RegressionModel(
    data_ames, num_features, cat_cols
)

Our activation function on the two linear layers (dense layers, in Keras terminology) is the rectified linear unit activation (ReLU) function. Please note that we couldn't have encapsulated the same architecture (easily) as a sequential model because of the different operations occurring on categorical and numerical types.

Next, define the loss criterion and optimizer. We take the mean square error (MSE) as the loss and stochastic gradient descent as our optimization algorithm:

criterion = torch.nn.MSELoss().to(device)
optimizer = torch.optim.SGD(house_model.parameters(), lr=0.001)

Now, create a data loader to input a batch of data at a time:

data_batch = torch.utils.data.TensorDataset(
    X_train_num_pt, X_train_cat_pt, y_train_pt
)
dataloader = torch.utils.data.DataLoader(
    data_batch, batch_size=10, shuffle=True
)

We set a batch size of 10. Now we can do our training.

Run the training!

Since this seems so much more verbose than what we saw in Keras in the Classifying in scikit-learn, Keras, and PyTorch recipe in Chapter 1, Getting Started with Artificial Intelligence in Python, we commented this code quite heavily. Basically, we have to loop over epochs, and within each epoch an inference is performance, an error is calculated, and the optimizer applies the adjustments according to the error.

This is the loop over epochs without the inner loop for training:

from tqdm.notebook import trange

train_losses, test_losses = [], []
n_epochs = 30
for epoch in trange(n_epochs):
    train_loss, test_loss = 0, 0
    # training code will go here:
    # <...>

    # print the errors in training and test:
    if epoch % 10 == 0 :
        print(
            'Epoch: {}/{}\t'.format(epoch, 1000),
            'Training Loss: {:.3f}\t'.format(
                train_loss / len(dataloader)
            ),
            'Test Loss: {:.3f}'.format(
                test_loss / len(dataloader)
            )
        )

The training is performed inside this loop over all the batches of the training data. This looks as follows:

    for (x_train_num_batch,
         x_train_cat_batch,
         y_train_batch) in dataloader:
        # predict y by passing x to the model 
        (x_train_num_batch,
         x_train_cat_batch, y_train_batch) = (
                x_train_num_batch.to(device),
                x_train_cat_batch.to(device),
                y_train_batch.to(device)
        )
        pred_ytrain = house_model.forward(
            x_train_num_batch, x_train_cat_batch
        )
        # calculate and print loss:
        loss = torch.sqrt(
            criterion(pred_ytrain, y_train_batch)
        ) 

        # zero gradients, perform a backward pass, 
        # and update the weights. 
        optimizer.zero_grad() 
        loss.backward() 
        optimizer.step()
        train_loss += loss.item()
        with torch.no_grad():
            house_model.eval()
            pred_ytest = house_model.forward(
                X_test_num_pt, X_test_cat_pt
            )
            test_loss += torch.sqrt(
                criterion(pred_ytest, y_test_pt)
            )

        train_losses.append(train_loss / len(dataloader))
        test_losses.append(test_loss / len(dataloader))

This is the output we get. TQDM provides us with a helpful progress bar. At every tenth epoch, we print an update to show training and validation performance:

Please note that we take the square root of nn.MSELoss because nn.MSELoss in PyTorch is defined as follows:

((input-target)**2).mean()

Let's plot how our model performs for training and validation datasets during training:

plt.plot(
    np.array(train_losses).reshape((n_epochs, -1)).mean(axis=1),
    label='Training loss'
)
plt.plot(
    np.array(test_losses).reshape((n_epochs, -1)).mean(axis=1),
    label='Validation loss'
)
plt.legend(frameon=False)
plt.xlabel('epochs')
plt.ylabel('MSE')

The following diagram shows the resulting plot:

We stopped our training just in time before our validation loss stopped decreasing.

We can also rank and bin our target variable and plot the predictions against it in order to see how the model is performing across the whole spectrum of house prices. This is to avoid the situation in regression, especially with MSE as the loss, that you only predict well for a mid-range of values, close to the mean, but don't do well for anything else. You can find the code for this in the notebook on GitHub. This is called a lift chart (here with 10 bins):

We can see that the model, in fact, predicts very closely across the whole range of house prices. In fact, we get a Spearman rank correlation of about 93% with very high significance, which confirms that this model performs with high accuracy.

How it works...

The deep learning neural network frameworks use different optimization algorithms. Popular among them are Stochastic Gradient Descent (SGD), Root Mean Square Propogation (RMSProp), and Adaptive Moment Estimation (ADAM).

We defined stochastic gradient descent as our optimization algorithm. Alternatively, we could have defined other optimizers:

opt_SGD = torch.optim.SGD(net_SGD.parameters(), lr=LR)
opt_Momentum = torch.optim.SGD(net_Momentum.parameters(), lr=LR, momentum=0.6)
opt_RMSprop = torch.optim.RMSprop(net_RMSprop.parameters(), lr=LR, alpha=0.1)
opt_Adam = torch.optim.Adam(net_Adam.parameters(), lr=LR, betas=(0.8, 0.98))

SGD works the same as gradient descent except that it works on a single example at a time. The interesting part is that the convergence is similar to the gradient descent and is easier on the computer memory.

RMSProp works by adapting the learning rates of the algorithm according to the gradient signs. The simplest of the variants checks the last two gradient signs and then adapts the learning rate by increasing it by a fraction if they are the same, or decreases it by a fraction if they are different.

ADAM is one of the most popular optimizers. It's an adaptive learning algorithm that changes the learning rate according to the first and second moments of the gradients.

Captum is a tool that can help us understand the ins and outs of the neural network model learned on the datasets. It can assist in learning the following:

Feature importance
Layer importance
Neuron importance

This is very important in learning interpretable neural networks. Here, integrated gradients have been applied to understand feature importance. Later, neuron importance is also demonstrated by using the layer conductance method.

There's more...

Given that we have our neural network defined and trained, let's find the important features and neurons using the captum library:

from captum.attr import (
    IntegratedGradients,
    LayerConductance,
    NeuronConductance
)
house_model.cpu()
for embedding in house_model.embeddings:
    embedding.cpu()

house_model.cpu()
ing_house = IntegratedGradients(forward_func=house_model.forward, )
#X_test_cat_pt.requires_grad_()
X_test_num_pt.requires_grad_()
attr, delta = ing_house.attribute(
 X_test_num_pt.cpu(),
 target=None,
 return_convergence_delta=True,
 additional_forward_args=X_test_cat_pt.cpu()
)
attr = attr.detach().numpy()

Now, we have a NumPy array of feature importances.

Layer and neuron importance can also be obtained using this tool. Let's look at the neuron importances of our first layer. We can pass on house_model.act1, which is the ReLU activation function on top of the first linear layer:

cond_layer1 = LayerConductance(house_model, house_model.act1)
cond_vals = cond_layer1.attribute(X_test, target=None)
cond_vals = cond_vals.detach().numpy()
df_neuron = pd.DataFrame(data = np.mean(cond_vals, axis=0), columns=['Neuron Importance'])
df_neuron['Neuron'] = range(10)

This is how it looks:

The diagram shows the neuron importances. Apparently, one neuron is just not important.

We can also see the most important variables by sorting the NumPy array we've obtained earlier:

df_feat = pd.DataFrame(np.mean(attr, axis=0), columns=['feature importance'] )
df_feat['features'] = num_features
df_feat.sort_values(
    by='feature importance', ascending=False
).head(10)

So here's a list of the 10 most important variables:

Often, feature importances can help us to both understand the model and prune our model to become less complex (and hopefully less overfitted).

Live decisioning customer values

Let's assume we have the following scenario: we have a list of customers to call in order to sell them our product. Each phone call costs money in call center personal salaries, so we want to reduce these costs as much as possible. We have certain information about each customer that could help us determine whether they are likely to buy. After every call, we can update our model. The main goal is to call only the most promising customers and to improve our insights into which customers are more likely to pay for our product.

In this recipe, we will approach this with active learning, a strategy where we actively decide what to explore (and learn) next. Our model will help decide whom to call. Because we will update our model after each query (phone call), we will use online learning models.

Getting ready

We'll prepare for our recipe by downloading our dataset and installing a few libraries.

Again, we will get the data from OpenML:

!pip install -q openml

import openml 
dataset = openml.datasets.get_dataset(1461)
X, y, categorical_indicator, _ = dataset.get_data(
  dataset_format='DataFrame',
  target=dataset.default_target_attribute
)
categorical_features = X.columns[categorical_indicator]
numeric_features = X.columns[
  [not(i) for i in categorical_indicator]
]

This dataset is called bank-marketing, and you can see a description on OpenML at https://www.openml.org/d/1461.

For each row, describing a single person, we have different features, numerical and categorical, that tell us about demographics and customer history.

To model the likelihood of customers signing up for our product, we will use the scikit-multiflow package that specializes in online models. We will also use the category_encoders package again:

!pip install scikit-multiflow category_encoders

With these two libraries in place, we can start the recipe.

How to do it...

We need to implement an exploration strategy and a model that is being continuously updated. We are using the online version of the random forest, the Hoeffding Tree, as our model. We are estimating the uncertainties at every step, and based on that we will return a candidate to call next.

As always, we will need to define a few preprocessing steps:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce

ordinal_encoder = ce.OrdinalEncoder(
  cols=None,  # all features that it encounters
  handle_missing='return_nan',
  handle_unknown='ignore'
).fit(X)

preprocessor = ColumnTransformer(
  transformers=[
    ('cat', ordinal_encoder, categorical_features),
    ('num', FunctionTransformer(validate=False), numeric_features)
])
preprocessor = preprocessor.fit(X)

Then we come to our active learning approach itself. This is inspired by modAL.models.ActiveLearner:

import numpy as np
from skmultiflow.trees.hoeffding_tree import HoeffdingTreeClassifier 
from sklearn.metrics import roc_auc_score
import random

class ActivePipeline:
  def __init__(self, model, preprocessor, class_weights):
    self.model = model
    self.preprocessor = preprocessor
    self.class_weights = class_weights

  @staticmethod
  def values(X):
    if isinstance(X, (np.ndarray, np.int64)):
      return X
    else:
      return X.values

  def preprocess(self, X):
    X_ = pd.DataFrame(
      data=self.values(X),
      columns=[
        'V1', 'V2', 'V3', 'V4',
        'V5', 'V6', 'V7', 'V8',
        'V9', 'V10', 'V11', 'V12',
        'V13', 'V14', 'V15', 'V16'
      ])
    return self.preprocessor.transform(X_)

  def fit(self, X, ys):
    weights = [self.class_weights[y] for y in ys]
    self.model.fit(self.preprocess(X), self.values(ys))

  def update(self, X, ys):
    if isinstance(ys, (int, float)):
      weight = self.class_weights[y]
    else:
      weight = [self.class_weights[y] for y in ys]

    self.model.partial_fit(
      self.preprocess(X),
      self.values(ys),
      weight
    )

  def predict(self, X):
    return self.model.predict(
      self.preprocess(X)
    )

  def predict_proba(self, X):
    return self.model.predict_proba(
      self.preprocess(X)
    )

  @staticmethod
  def entropy(preds):
    return -np.sum(
      np.log((preds + 1e-15) * preds)
      / np.log(np.prod(preds.size))
    )

  def max_margin_uncertainty(self, X, method: str='entropy',
      exploitation: float=0.9, favor_class: int=1, k: int=1
  ):
    '''similar to modAL.uncertainty.margin_uncertainty
    ''' 
    probs = self.predict_proba(X)
    if method=='margin':
      uncertainties = np.abs(probs[:,2] - probs[:, 1]) / 2.0
    elif method=='entropy':
      uncertainties = np.apply_along_axis(self.entropy, 1, probs[:, (1,2)])
    else: raise(ValueError('method not implemented!'))

    if favor_class is None:
      weights = uncertainties 
    else: weights = (1.0 - exploitation) * uncertainties + exploitation * probs[:, favor_class]

    if self.sampling:
      ind = random.choices(
        range(len(uncertainties)), weights, k=k
      )
    else:
      ind = np.argsort(weights, axis=0)[::-1][:k]
    return ind, np.mean(uncertainties[ind])

  def score(self, X, y, scale=True):
    probs = self.predict_proba(X, probability=2)
    if scale:
      probs = np.clip(probs - np.mean(probs) + 0.5, 0, 1)
    return roc_auc_score(y, probs)

Again, we create a scikit-learn-compatible class. It basically holds a machine learning model and a data preprocessor. We implement fit() and predict(), but also score() to get a model performance. We also implement an update() method that calls partial_fit() of the machine learning model. Calling partial_fit() instead of fit() considerably speeds up the computations, because we don't have to start from scratch every time we get new data.

Here's how to create the active learning pipeline:

active_pipeline = ActivePipeline(
  HoeffdingTreeClassifier(),
  preprocessor,
  class_weights.to_dict()
)
active_pipeline.model.classes = [0, 1, 2]

We can run different simulations on our dataset with this setup. For example, we can compare a lot of experimentation (0.5 exploitation) against only exploitation (1.0), or no learning at all after the first batch. We basically go through a loop:

Via active_pipeline.max_margin_uncertainty(), we present data to the active pipeline, and get a number of data points that integrate uncertainty and target predictions according to our preferences of the integration method.
Once we get the actual results for these data points, we can update our model: active_pipeline.update().

You can see an example of this in the notebook on GitHub.

We can see that curious wins out after the first few examples. Exploitation is actually the least successful scheme. By not updating the model, performance deteriorates over time:

This is an ideal scenario for active learning or reinforcement learning, because, not unlike in reinforcement learning, uncertainty can be an additional criterion, apart from positive expectation, from a customer. Over time, this entropy reduction-seeking behavior reduces as the model's understanding of customers improves.

How it works...

It's worth delving a bit more into a few of the concepts and strategies employed in this recipe.

Active learning

Active learning means that we can actively query for more information; in other words, exploration is part of our strategy. This can be useful in scenarios where we have to actively decide what to learn, and where what we learn influences not only how much our model learns and how well, but also how much return on an investment we can get.

Hoeffding Tree

The Hoeffding Tree (also known as the Very Fast Decision Tree, VFDT for short) was introduced in 2001 by Geoff Hulten and others (Mining time-changing data streams). It is an incrementally growing decision tree for streamed data. Tree nodes are expanded based on the Hoeffding bound (or additive Chernoff bound). It was theoretically shown that, given sufficient training data, a model learned by the Hoeffding tree converges very closely to the one built by a non-incremental learner.

The Hoeffding bound is defined as follows:

It's important to note that the Hoeffding Tree doesn't deal with data distributions that change over time.

Class weighting

Since we are dealing with an imbalanced dataset, let's use class weights. This basically means that we are upsampling the minority (signing up) class and downsampling the majority class (not signing up).

The formula for the class weights is as follows:

Similarly, in Python, we can write the following:

class_weights = len(X) / (y.astype(int).value_counts() * 2)

We can then use these class weights for sampling.

We'll close the recipe with a few more pointers.

Battling algorithmic bias

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a commercial algorithm that assigns risk scores to criminal defendants based on their case records. This risk score corresponds to the likelihood of reoffending (recidivism) and of committing violent crimes, and this score is used in court to help determine sentences. The ProPublica organization obtained scores and data in 1 county in Florida, containing data on about 7,000 people. After waiting for 2 years to see who reoffended, they audited the COMPAS model in 2016 and found very concerning issues with the model. Among the ProPublica findings was discrimination according to gender, race, and ethnicity, particularly in the case of over-predicting recidivism for ethnic minorities.

Discrimination presents a major problem for AI systems, and illustrates the importance of auditing your model and the data you feed into your model. Models built on human decisions will amplify human biases if this bias is ignored. Not just from a legal perspective, but also ethically, we want to build models that don't disadvantage certain groups. This poses an interesting challenge for model building.

Generally, we would think that justice should be blind to gender or race. This means that court decisions should not take these sensitive variables like race or gender into account. However, even if we omit them from our model training, these sensitive variables might be correlated to some of the other variables, and therefore they can still affect decisions, to the detriment of protected groups such as minorities or women.

In this section, we are going to work with the COMPAS modeling dataset as provided by ProPublica. We are going to check for racial bias, and then create a model to remove it. You can find the original analysis by ProPublica at https://github.com/propublica/compas-analysis.

Getting ready

Before we can start, we'll first download the data, mention issues in preprocessing, and install the libraries.

Let's get the data:

!wget https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv
import pandas as pd
date_cols = [
    'compas_screening_date', 'c_offense_date',
    'c_arrest_date', 'r_offense_date', 
    'vr_offense_date', 'screening_date',
    'v_screening_date', 'c_jail_in',
    'c_jail_out', 'dob', 'in_custody', 
    'out_custody'
]
data = pd.read_csv(
    'compas-scores-two-years.csv',
    parse_dates=date_cols
)

Each row represents the risk of violence and the risk of recidivism scores for an inmate. The final column, two_year_recid, indicates our target.

ProPublica compiled their dataset from different sources, which they matched up according to the names of offenders:

Criminal records from the Broward County Clerk's Office website
Public incarceration records from the Florida Department of Corrections website
COMPAS scores, which they obtained through a public record information request

We can highlight a few issues in the dataset:

The column race is a protected category. It should not be used as a feature for model training, but as a control.
There are full names in the dataset, which will not be useful, or might even give away the ethnicity of the inmates.
There are case numbers in the dataset. These will likely not be useful for training a model, although they might have some target leakage in the sense that increasing case numbers might give an indication of the time, and there could be a drift effect in the targets over time.
There are missing values. We will need to carry out imputation.
There are date stamps. These will probably not be useful and might even come with associated problems (see point 3). However, we can convert these features into UNIX epochs, which indicates the number of seconds that have elapsed since 1970, and then calculate time periods between date stamps, for example, by repurposing NumericDifferenceTransformer that we saw in an earlier recipe. We can then use these periods as model features rather than the date stamps.
We have several categorical variables.
The charge description (c_charge_desc) might need to be cleaned up.

Mathias Barenstein has pointed out (https://arxiv.org/abs/1906.04711) a data processing error in ProPublica's cutoff that resulted in keeping 40% more recidivists than they should have. We'll apply his correction to the two-year cutoff:

import datetime
indexes = data.compas_screening_date <= pd.Timestamp(datetime.date(2014, 4, 1))
assert indexes.sum() == 6216
data = data[indexes]

We will use a few libraries in this recipe, which can be installed as follows:

!pip install category-encoders

category-encoders is a library that provides functionality for categorical encoding beyond what scikit-learn provides.

How to do it...

Let's get some basic terminology out of the way first. We need to come up with metrics for fairness. But what does fairness (or, if we look at unfairness, bias) mean?

Fairness can refer to two very different concepts:

[equal opportunity]: There should be no difference in the relative ratios of predicted by the model versus actually true; or
[equal outcome]: There should be no difference between model outcomes at all.

The first is also called equal odds, while the latter refers to equal false positive rates. While equal opportunity means that each group should be given the same chance regardless of their group, the equal outcome strategy implies that the underperforming group should be given more lenience or chances relative to the other group(s).

We'll go with the idea of false positive rates, which intuitively appeals, and which is enshrined in law in many jurisdictions in the case of equal employment opportunities. We'll provide a few resources about these terms in the See also section.

Therefore, the logic for the impact calculation is based on values in the confusion matrix, most importantly, false positives, which we've just mentioned. These cases are predicted positive even though they are actually negative; in our case, people predicted as reoffenders, who are not reoffenders. Let's write a function for this:

def confusion_metrics(actual, scores, threshold):
    y_predicted = scores.apply(
        lambda x: x >= threshold
    ).values
    y_true = actual.values
    TP = (
        (y_true==y_predicted) & 
        (y_predicted==1)
    ).astype(int)
    FP = (
        (y_true!=y_predicted) &
        (y_predicted==1)
    ).astype(int)
    TN = (
        (y_true==y_predicted) &
        (y_predicted==0)
    ).astype(int)
    FN = (
        (y_true!=y_predicted) &
        (y_predicted==0)
    ).astype(int)
    return TP, FP, TN, FN

We can now use this function in order to summarize the impact on particular groups with this code:

def calculate_impacts(data, sensitive_column='race', recid_col='is_recid', score_col='decile_score.1', threshold=5.0):
    if sensitive_column == 'race':
      norm_group = 'Caucasian'
    elif sensitive_column == 'sex':
      norm_group = 'Male'
    else:
      raise ValueError('sensitive column not implemented')
    TP, FP, TN, FN = confusion_metrics(
        actual=data[recid_col],
        scores=data[score_col],
        threshold=threshold
    )
    impact = pd.DataFrame(
        data=np.column_stack([
              FP, TN, FN, TN,
              data[sensitive_column].values, 
              data[recid_col].values,
              data[score_col].values / 10.0
             ]),
        columns=['FP', 'TP', 'FN', 'TN', 'sensitive', 'reoffend', 'score']
    ).groupby(by='sensitive').agg({
        'reoffend': 'sum', 'score': 'sum',
        'sensitive': 'count', 
        'FP': 'sum', 'TP': 'sum', 'FN': 'sum', 'TN': 'sum'
    }).rename(
        columns={'sensitive': 'N'}
    )

    impact['FPR'] = impact['FP'] / (impact['FP'] + impact['TN'])
    impact['FNR'] = impact['FN'] / (impact['FN'] + impact['TP'])
    impact['reoffend'] = impact['reoffend'] / impact['N']
    impact['score'] = impact['score'] / impact['N']
    impact['DFP'] = impact['FPR'] / impact.loc[norm_group, 'FPR']
    impact['DFN'] = impact['FNR'] / impact.loc[norm_group, 'FNR']
    return impact.drop(columns=['FP', 'TP', 'FN', 'TN'])

This first calculates the confusion matrix with true positives and false negatives, and then encodes the adverse impact ratio (AIR), known in statistics also as the Relative Risk Ratio (RRR). Given any performance metric, we can write the following:

This expresses an expectation that the metric for the protected group (African-Americans) should be the same as the metric for the norm group (Caucasians). In this case, we'll get 1.0. If the metric of the protected group is more than 20 percentage points different to the norm group (that is, lower than 0.8 or higher than 1.2), we'll flag it as a significant discrimination.

Norm group: a norm group, also known as a standardization sample or norming group, is a sample of the dataset that represents the population to which the statistic is intended to be compared. In the context of bias, its legal definition is the group with the highest success, but in some contexts, the entire dataset or the most frequent group are taken as the baseline instead. Pragmatically, we take the white group, since they are the biggest group, and the group for which the model works best.

In the preceding function, we calculate the false positive rates by sensitive group. We can then check whether the false positive rates for African-Americans versus Caucasians are disproportionate, or rather whether the false positive rates for African-Americans are much higher. This would mean that African-Americans get flagged much more often as repeat offenders than they should be. We find that this is indeed the case:

A short explanation about this table follows:

reoffend: frequencies of reoffending
score: average score for the group
N: the total number of people in the group
FPR: false positive rates
FNR: false negative rates
DFP: disproportionate false positive
DFN: disproportionate false negative

The last FPR and FNR columns together can give an idea about the general quality of the model. If both are high, the model just doesn't perform well for the particular group. The last two columns express the adverse impact ratio of FPR and FNR ratios, respectively, which is what we'll mostly focus on. We need to reduce the racial bias in the model by reducing the FPR of African-Americans to a tolerable level.

Let's do some preprocessing and then we'll build the model:

from sklearn.feature_extraction.text import CountVectorizer
from category_encoders.one_hot import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

charge_desc = data['c_charge_desc'].apply(lambda x: x if isinstance(x, str) else '')
count_vectorizer = CountVectorizer(
    max_df=0.85, stop_words='english',
    max_features=100, decode_error='ignore'
)
charge_desc_features = count_vectorizer.fit_transform(charge_desc)

one_hot_encoder = OneHotEncoder()
charge_degree_features = one_hot_encoder.fit_transform(
    data['c_charge_degree']
)

data['race_black'] = data['race'].apply(lambda x: x == 'African-American').astype(int)
stratification = data['race_black'] + (data['is_recid']).astype(int) * 2

CountVectorizer counts a vocabulary of words, indicating how many times each word is used. This is called a bag-of-words representation, and we apply it to the charge description column. We exclude English stop words, which are very common words such as prepositions (such as on or at) and personal pronouns (for example I or me); we also limit the vocabulary to 100 words and words that don't appear in more than 85% of the fields.

We apply dummy encoding (one-hot encoding) to the charge degree.

Why do we use two different transformations? Basically, the description is a textual description of why someone was charged with a crime. Every field is different. If we used one-hot encoding, every field would get their own dummy variable, and we wouldn't be able to see any commonalities between fields.

In the end, we create a new variable for stratification in order to make sure that we have similar proportions in the training and test datasets for both recidivism (our target variable) and whether someone is African-American. This will help us to calculate metrics to check for discrimination:

y = data['is_recid']
X = pd.DataFrame(
    data=np.column_stack(
        [data[['juv_fel_count', 'juv_misd_count',
 'juv_other_count', 'priors_count', 'days_b_screening_arrest']], 
          charge_degree_features, 
          charge_desc_features.todense()
        ]
    ),
    columns=['juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest'] \
    + one_hot_encoder.get_feature_names() \
    + count_vectorizer.get_feature_names(),
    index=data.index
)
X['jailed_days'] = (data['c_jail_out'] - data['c_jail_in']).apply(lambda x: abs(x.days))
X['waiting_jail_days'] = (data['c_jail_in'] - data['c_offense_date']).apply(lambda x: abs(x.days))
X['waiting_arrest_days'] = (data['c_arrest_date'] - data['c_offense_date']).apply(lambda x: abs(x.days))
X.fillna(0, inplace=True)


columns = list(X.columns)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33,
    random_state=42,
    stratify=stratification
)  # we stratify by black and the target

We do some data engineering, deriving variables to record how many days someone has spent in jail, has waited for a trial, or has waited for an arrest.

We'll build a neural network model using jax similar to the one we've encountered in the Classifying in scikit-learn, Keras, and PyTorch recipe in Chapter 1, Getting Started with Artificial Intelligence in Python. This time, we'll do a fully fledged implementation:

import jax.numpy as jnp
from jax import grad, jit, vmap, ops, lax
import numpy.random as npr
import numpy as onp
import random
from tqdm import trange
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import StandardScaler


class JAXLearner(ClassifierMixin):
  def __init__(self, layer_sizes=[10, 5, 1], epochs=20, batch_size=500, lr=1e-2):
    self.params = self.construct_network(layer_sizes)
    self.perex_grads = jit(grad(self.error))
    self.epochs = epochs
    self.batch_size = batch_size
    self.lr = lr

  @staticmethod
  def construct_network(layer_sizes=[10, 5, 1]):
    '''Please make sure your final layer corresponds to targets in dimensions.
    '''
    def init_layer(n_in, n_out):
      W = npr.randn(n_in, n_out)
      b = npr.randn(n_out,)
      return W, b
      
    return list(map(init_layer, layer_sizes[:-1], layer_sizes[1:]))

  @staticmethod
  def sigmoid(X):  # or tanh
    return 1/(1+jnp.exp(-X))

  def _predict(self, inputs):
    for W, b in self.params:
      outputs = jnp.dot(inputs, W) + b
      inputs = self.sigmoid(outputs)
    return outputs

  def predict(self, inputs):
    inputs = self.standard_scaler.transform(inputs)
    return onp.asarray(self._predict(inputs))

  @staticmethod
  def mse(preds, targets, other=None):
    return jnp.sqrt(jnp.sum((preds - targets)**2))

  @staticmethod
  def penalized_mse(preds, targets, sensitive):
    err = jnp.sum((preds - targets)**2)
    err_s = jnp.sum((preds * sensitive - targets * sensitive)**2)
    penalty = jnp.clip(err_s / err, 1.0, 2.0)
    return err * penalty

  def error(self, params, inputs, targets, sensitive):
      preds = self._predict(inputs)
      return self.penalized_mse(preds, targets, sensitive)

  def fit(self, X, y, sensitive):
    self.standard_scaler = StandardScaler()
    X = self.standard_scaler.fit_transform(X)
    N = X.shape[0]
    indexes = list(range(N))
    steps_per_epoch = N // self.batch_size

    for epoch in trange(self.epochs, desc='training'):
        random.shuffle(indexes)
        index_offset = 0
        for step in trange(steps_per_epoch, desc='iteration'):
            grads = self.perex_grads(
                self.params, 
                X[indexes[index_offset:index_offset+self.batch_size], :], 
                y[indexes[index_offset:index_offset+self.batch_size]],
                sensitive[indexes[index_offset:index_offset+self.batch_size]]
            )
            # print(grads)
            self.params = [(W - self.lr * dW, b - self.lr * db)
                      for (W, b), (dW, db) in zip(self.params, grads)]
            index_offset += self.batch_size

This is a scikit-learn wrapper of a JAX neural network. For scikit-learn compatibility, we inherit from ClassifierMixin and implement fit() and predict(). The most important part here is the penalized MSE method, which, in addition to model predictions and targets, takes into account a sensitive variable.

Let's train it and check the performance. Please note that we feed in X, y, and sensitive_train, which we define as the indicator variable for African-American for the training dataset:

sensitive_train = X_train.join(
    data, rsuffix='_right'
)['race_black']
jax_learner = JAXLearner([X.values.shape[1], 100, 1])
jax_learner.fit(
    X_train.values,
    y_train.values,
    sensitive_train.values
)

We visualize the statistics as follows:

X_predicted = pd.DataFrame(
    data=jax_learner.predict(
        X_test.values
    ) * 10,
    columns=['score'], 
    index=X_test.index
).join(
    data[['sex', 'race', 'is_recid']], 
    rsuffix='_right'
)
calculate_impacts(X_predicted, score_col='score')

This is the table we get:

We can see that the disproportionate false positive rate for African-Americans is very close to (and even lower than) 1.0, which is what we wanted. The test set is small and doesn't contain enough samples for Asians and native Americans, so we can't calculate meaningful statistics for those groups. We could, however, extend our approach to encompass these two groups as well if we wanted to ensure that these two groups had equal false positive rates.

How it works...

The keys for this to work are custom objective functions or loss functions. This is far from straightforward in scikit-learn, although we will show an implementation in the following section.

Generally, there are different possibilities for implementing your own cost or loss functions.

LightGBM, Catboost, and XGBoost each provide an interface with many loss functions and the ability to define custom loss functions.
PyTorch and Keras (TensorFlow) provide an interface.
You can implement your model from scratch (this is what we've done in the main recipe).

Scikit-learn generally does not provide a public API for defining your own loss functions. For many algorithms, there is only a single choice, and sometimes there are a couple of alternatives. The rationale in the case of split criteria with trees is that loss functions have to be performant, and only Cython implementations will guarantee this. This is only available in a non-public API, which means it is more difficult to use.

Finally, when there's no (straightforward) way to implement a custom loss, you can wrap your algorithms in a general optimization scheme such as genetic algorithms.

In neural networks, as long as you provide a differentiable loss function, you can plug in anything you want.

Basically, we were able to encode the adverse impact as a penalty term with the Mean Squared Error (MSE) function. It is based on the MSE that we've mentioned before, but has a penalty term for adverse impact. Let's look again at the loss function:

  @staticmethod
  def penalized_mse(preds, targets, sensitive):
    err = jnp.sum((preds - targets)**2)
    err_s = jnp.sum((preds * sensitive - targets * sensitive)**2)
    penalty = jnp.clip(err_s / err, 1.0, 2.0)
    return err * penalty

The first thing to note is that instead of two variables, we pass three variables. sensitive is the variable relevant to the adverse impact, indicating if we have a person from a protected group.

The calculation works as follows:

We calculate the MSE overall, err, from model predictions and targets.
We calculate the MSE for the protected group, err_s.
We take the ratio of the MSE for the protected group over the MSE overall (AIR) and limit it to between 1.0 and 2.0. We don't want values lower than 1, because we are only interested in the AIR if it's negatively affecting the protected group.
We then multiply AIR by the overall MSE.

As for 2, the MSE can simply be calculated by multiplying the predictions and targets, each by sensitive. That would cancel out all points, where sensitive is equal to 0.

As for 4, it might seem that this would cancel out the overall error, but we see that it actually seems to work. We probably could have added the two terms as well to give both errors a similar importance.

We use the autograd functionality in Jax to differentiate this.

There's more...

In the following, we'll use the non-public scikit-learn API to implement a custom split criterion for decision trees. We'll use this to train a random forest model with the COMPAS dataset:

This extends the implementation of the Hellinger criterion by Evgeni Dubov (https://github.com/EvgeniDubov/hellinger-distance-criterion).

%%cython

from sklearn.tree._criterion cimport ClassificationCriterion
from sklearn.tree._criterion cimport SIZE_t

import numpy as np
cdef double INFINITY = np.inf

from libc.math cimport sqrt, pow
from libc.math cimport abs


cdef class PenalizedHellingerDistanceCriterion(ClassificationCriterion):    
    
    cdef double proxy_impurity_improvement(self) nogil:
        cdef double impurity_left
        cdef double impurity_right
        
        self.children_impurity(&impurity_left, &impurity_right)
       
        return impurity_right + impurity_left

    cdef double impurity_improvement(self, double impurity) nogil:
        cdef double impurity_left
        cdef double impurity_right

        self.children_impurity(&impurity_left, &impurity_right)
        return impurity_right + impurity_left
        
    cdef double node_impurity(self) nogil:
        cdef SIZE_t* n_classes = self.n_classes
        cdef double* sum_total = self.sum_total
        cdef double hellinger = 0.0
        cdef double sq_count
        cdef double count_k
        cdef SIZE_t k
        cdef SIZE_t c

        for k in range(self.n_outputs):
            for c in range(n_classes[k]):
                hellinger += 1.0

        return hellinger / self.n_outputs

    cdef void children_impurity(self, double* impurity_left,
                                double* impurity_right) nogil:
        cdef SIZE_t* n_classes = self.n_classes
        cdef double* sum_left = self.sum_left
        cdef double* sum_right = self.sum_right
        cdef double hellinger_left = 0.0
        cdef double hellinger_right = 0.0
        cdef double count_k1 = 0.0
        cdef double count_k2 = 0.0

        cdef SIZE_t k
        cdef SIZE_t c

        # stop splitting in case reached pure node with 0 samples of second class
        if sum_left[1] + sum_right[1] == 0:
            impurity_left[0] = -INFINITY
            impurity_right[0] = -INFINITY
            return
        
        for k in range(self.n_outputs):
            if(sum_left[0] + sum_right[0] > 0):
                count_k1 = sqrt(sum_left[0] / (sum_left[0] + sum_right[0]))
            if(sum_left[1] + sum_right[1] > 0):
                count_k2 = sqrt(sum_left[1] / (sum_left[1] + sum_right[1]))

            hellinger_left += pow((count_k1  - count_k2), 2)
            
            if(sum_left[0] + sum_right[0] > 0):    
                count_k1 = sqrt(sum_right[0] / (sum_left[0] + sum_right[0]))
            if(sum_left[1] + sum_right[1] > 0):
                count_k2 = sqrt(sum_right[1] / (sum_left[1] + sum_right[1]))

            if k==0:
              hellinger_right += pow((count_k1  - count_k2), 2)
            else:
              hellinger_right -= pow((count_k1  - count_k2), 2)
        
        impurity_left[0]  = hellinger_left  / self.n_outputs
        impurity_right[0] = hellinger_right / self.n_outputs

Let's use this for training and test it:

ensemble = [
    DecisionTreeClassifier(
      criterion=PenalizedHellingerDistanceCriterion(
        2, np.array([2, 2], dtype='int64')
      ),
      max_depth=100
    ) for i in range(100)
]
for model in ensemble:
    model.fit(
        X_train,
        X_train.join(
            data,
            rsuffix='_right'
        )[['is_recid', 'race_black']]
    )
Y_pred = np.array(
    [model.predict(X_test) for model in
     ensemble]
)
predictions2 = Y_pred.mean(axis=0)

This gives us an AUC of 0.62:

We can see that, although we came a long way, we didn't completely remove all bias. 30% (DFP for African-Americans) would still be considered unacceptable. We could try different refinements or sampling strategies to improve the result. Unfortunately, we wouldn't be able to use this model in practice.

As an example, a way to address this is to do model selection within the random forest. Since each of the trees would have their own way of classifying people, we could calculate the adverse impact statistics for each of the individual trees or combinations of trees. We could remove trees until we are left with a subset of trees that satisfy our adverse impact conditions. This is beyond the scope of this chapter.

Forecasting CO₂ time series

In this recipe, we will test out some well-known models (ARIMA, SARIMA) and signal decomposition by forecasting using Facebook's Prophet library on the time series data, in order to check their performance at forecasting our time series of CO₂ values.

Getting ready

In order to prepare for this recipe, we'll install libraries and download a dataset.

We'll use the statsmodels package and prophet:

pip install statsmodels fbprophet

We will analyze the CO₂ concentration data in this recipe. You can see the data loading in the notebook on GitHub accompanying this recipe, or in the scikit-learn Gaussian process regression (GPR) example regarding Mauna Loa CO₂ data: https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-co2-py

This dataset is one of the earliest recordings available on atmospheric recordings of CO₂. As it will be later observed, this data follows a sinusoidal pattern, with the CO₂ concentration rising in winter and falling in the summer owing to the decreasing quantity of plants and vegetation in the winter season:

X,y = load_mauna_loa_atmospheric_co2()

The dataset contains the average CO₂ concentration measured at Mauna Loa Observatory in Hawaii from 1958 to 2001. We will model the CO₂ concentration with respect to that.

How to do it...

Now we'll get to forecasting our time series of CO₂ data. We'll first explore the dataset, and then we'll apply the ARIMA and SARIMA techniques.

Let's have a look at the time series:

df_CO2 = pd.DataFrame(data = X, columns = ['Year'])
df_CO2['CO2 in ppm'] = y
lm = sns.lmplot(x='Year', y='CO2 in ppm', data=df_CO2, height=4, aspect=4)
fig = lm.fig 
fig.suptitle('CO2 conc. mauna_loa 1958-2001', fontsize=12)

Here's the graph:

The script here shows the time series seasonal decomposition of the CO₂ data, showing a clear seasonal variation in the CO₂ concentration, which can be traced back to the biology:

import statsmodels.api as stmd
d = stm.datasets.co2.load_pandas()
co2 = d.data
co2.head()
y = co2['co2']
y = y.fillna(
    y.interpolate()
)  # Fill missing values by interpolation

Now that we have preprocessed data for decomposition, let's go ahead with it:

from pylab import rcParams
rcParams['figure.figsize'] = 11, 9
result = stm.tsa.seasonal_decompose(y, model='additive')
pd.plotting.register_matplotlib_converters()
result.plot()
plt.show()

Here, we see the decomposition: the observed time series, its trend, seasonal components, and what remains unexplained, the residual element:

Now, let's analyze the time series.

Analyzing time series using ARIMA and SARIMA

We will fit ARIMA and SARIMA models to the dataset.

We'll define our two models and apply them to each point in the test dataset. Here, we iteratively fit the model on all the points and predict the next point, as a one-step-ahead.

First, we split the data:

# taking a 90/10 split between training and testing:
future = int(len(y) * 0.9)
print('number of train samples: %d test samples %d' (future, len(y)-future)
)
train, test = y[:future], y[future:]

This leaves us with 468 samples for training and 53 for testing.

Next, we define the models:

from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

def get_arima_model(history, order=(5, 1, 0)):
    return ARIMA(history, order=order)

def get_sarima_model(
    history,
    order=(5, 1, 1),
    seasonal_order=(0, 1, 1, 4)
    ):
    return SARIMAX(
        history,
        order=order,
        enforce_stationarity=True,
        enforce_invertibility=False,
        seasonal_order=seasonal_order
    )

Then we train the models:

from sklearn.metrics import mean_squared_error

def apply_model(train, test, model_fun=get_arima_model):
  '''we just roll with the model and apply it to successive
  time steps
  '''
  history = list(train)
  predictions = []
  for t in test:
    model = model_fun(history).fit(disp=0)
    output = model.forecast()
    predictions.append(output[0])
    history.append(t)
  error = mean_squared_error(test, predictions)
  print('Test MSE: %.3f' % error)
  #print(model.summary().tables[1])
  return predictions, error

predictions_arima, error_arima = apply_model(train, test)
predictions_sarima, error_sarima = apply_model(
    train, test, get_sarima_model
)

We get an MSE in the test of 0.554 and 0.405 for ARIMA and SARIMA models, respectively. Let's see how the models fit graphically:

We could do a parameter exploration using the Akaike information criterion (AIC), which expresses the quality of the model relative to the number of parameters in the model. The model object returned by the fit function in statsmodels includes the AIC, so we could do a grid search over a range of parameters, and then select the model that minimizes the AIC.

How it works...

Time series data is a collection of observations x(t), where each data point is recorded at time t. In most cases, time is a discrete variable, that is, . We are looking at forecasting, which is the task of predicting future values based on the previous observations in the time series.

In order to explain the models that we've used, ARIMA and SARIMA, we'll have to go step by step, and explain each in turn:

Autoregression (AR)
Moving Average (MA)
Autoregressive Moving Average (ARMA)
Autoregressive Integrated Moving Average (ARIMA) and
Seasonal Autoregressive Integrated Moving Average (SARIMA)

ARIMA and SARIMA are based on the ARMA model, which is an autoregressive moving average model. Let's briefly go through some of the basics.

ARMA is a linear model, defined in two parts. First, the autoregressive linear model:

Here, are parameters and is a constant, is white noise, and is the order of the model (or the window size of the linear model). The second part of ARMA is the moving average, and this is again a linear regression, but of non-observable, lagged error terms, defined as follows:

Here, is the order of the moving average, are the parameters, and the expectation or the mean of the time series . The ARMA(p, q) model is then the composite of both of these models, AR(p) and MA(q):

The fitting procedure is a bit involved, particularly because of the MA part. You can read up on the Box-Jenkins method on Wikipedia if you are interested: https://en.wikipedia.org/wiki/Box%E2%80%93Jenkins_method

There are a few limitations to note, however. The time series has to be the following:

Stationary: Basically, mean and covariance across observations have to be constant over time.
Nonperiodic: Although bigger values for p and q could be used to model seasonality, it is not part of the model.
Linear: Each value for can be modeled as a linear combination of previous values and error terms.

There are different extensions of ARMA to address the first two limitations, and that's where ARIMA and SARIMA come in.

ARIMA (p, d, q) stands for autoregressive integrated moving average. It comes with three parameters:

p: The number of autoregressive terms (autoregression)
d: The number of non-seasonal differences needed for stationarity (integration)
q: The number of lagged forecast errors (moving average)

The integration refers to differencing. In order to stabilize the mean, we can take the difference between consecutive observations. This can also remove a trend or eliminate seasonality. It can be written as follows:

This can be repeated several times, and this is what the parameter d describes that ARIMA comes with. Please note that ARIMA can handle drifts and non-stationary time series. However, it is still unable to handle seasonality.

SARIMA stands for seasonal ARIMA, and is an extension of ARIMA in that it also takes into account the seasonality of the data.

SARIMA(p, d, q)(P, D, Q)m contains the non-seasonal parameters of ARIMA and additional seasonal parameters. Uppercase letters P, D, and Q annotate the seasonal moving average and autoregressive components, where m is the number of periods in each season. Often this is the number of periods in a year; for example m=4 would stand for a quarterly seasonal effect, meaning that D stands for seasonal differencing between observations Xt and Xt-m, and P and Q stand for linear models with backshifts of m.

In Python, the statsmodels library provides a method to perform the decomposition of the signal based on the seasonality of the data.

There's more...

Prophet is a library provided by Facebook for forecasting time series data. It works on an additive model and fits non-linear models. The library works best when the data has strong seasonal effects and has enough historic trends available.

Let's see how we can use it:

from fbprophet import Prophet

train_df = df_CO2_fb['1958':'1997']
test_df = df_CO2_fb['1998':'2001']
train_df = train_df.reset_index()
test_df = test_df.reset_index()Co2_model= Prophet(interval_width=0.95)

Co2_model.fit(train_df)
train_forecast = Co2_model.predict(train_df)
test_forecast = Co2_model.predict(test_df)
fut = Co2_model.make_future_DataFrame(periods=12, freq='M')
forecast_df = Co2_model.predict(fut)
Co2_model.plot(forecast_df)

Here are our model forecasts:

We get a similar decomposition as before with the ARIMA/SARIMA models, namely, the trend and the seasonal components:

The yearly variation nicely shows the rise and fall of the CO₂ concentration according to the seasons. The trend clearly goes up over time, which could be worrisome if you think about global warming.