Following the tasters with scikit-learn, Keras, and PyTorch in the previous chapter, in this chapter, we will move on to more end-to-end examples. These examples are more advanced in the sense that they include more complex transformations and model types.

We'll be predicting partner choices with sklearn, where we'll implement a lot of custom transformer steps and more complicated machine learning pipelines. We'll then predict house prices in PyTorch and visualize feature and neuron importance. After that, we will perform active learning to decide customer values together with online learning in sklearn. In the well-known case of repeat offender prediction, we'll build a model without racial bias. Last, but not least, we'll forecast time series of CO_{2} levels.

**Online learning**in this context (as opposed to internet-based learning) refers to a model update strategy that incorporates training data that comes in sequentially. This can be useful in cases where the dataset is very big (often the case with images, videos, and texts) or where it's important to keep the model up to date given the changing nature of the data.

In many of these recipes, we've shortened the description to the most salient details in order to highlight particular concepts. For the full details, please refer to the notebooks on GitHub.

In this chapter, we'll be covering the following recipes:

- Transforming data in scikit-learn
- Predicting house prices in PyTorch
- Live decisioning customer values
- Battling algorithmic bias
- Forecasting CO
_{2}time series

# Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/PackPublishing/Artificial-Intelligence-with-Python-Cookbook/tree/master/chapter02.

# Transforming data in scikit-learn

In this recipe, we will be building more complex pipelines using mixed-type columnar data. We'll use a speed dating dataset that was published in 2006 by Fisman *et al.*: https://doi.org/10.1162/qjec.2006.121.2.673

Perhaps this recipe will be informative in more ways than one, and we'll learn something useful about the mechanics of human mating choices.

The dataset description on the OpenML website reads as follows:

**first date**with every other participant of the opposite sex. At the end of their 4 minutes, participants were asked whether they would like to see their date again. They were also asked to rate their date on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include demographics, dating habits, self-perception across key attributes, beliefs in terms of what others find valuable in a mate, and lifestyle information.

The problem is to predict mate choices from what we know about participants and their matches. This dataset presents some challenges that can serve an illustrative purpose:

- It contains 123 different features, of different types:
- Categorical
- Numerical
- Range features

It also contains the following:

- Some missing values
- Target imbalance

On the way to solving this problem of predicting mate choices, we will build custom encoders in scikit-learn and a pipeline comprising all features and their preprocessing steps.

The primary focus in this recipe will be on pipelines and transformers. In particular, we will build a custom transformer for working with range features and another one for numerical features.

## Getting ready

We'll need the following libraries for this recipe. They are as follows:

- OpenML to download the dataset
`openml_speed_dating_pipeline_steps`to use our custom transformer`imbalanced-learn`to work with imbalanced classes`shap`to show us the importance of features

In order to install them, we can use `pip` again:

pip install -q openml openml_speed_dating_pipeline_steps==0.5.5 imbalanced_learn category_encoders shap

In order to retrieve the data, we will use the OpenML Python API. The `get_dataset()` method will download the dataset; with `get_data()`, we can get pandas DataFrames for features and target, and we'll conveniently get the information on categorical and numerical feature types:

import openml

dataset = openml.datasets.get_dataset(40536)

X, y, categorical_indicator, _ = dataset.get_data(

dataset_format='DataFrame',

target=dataset.default_target_attribute

)

categorical_features = list(X.columns[categorical_indicator]) numeric_features = list(

X.columns[[not(i) for i in categorical_indicator]]

)

`numpy.nan`, which lets us skip this conversion. You can see this preprocessor on GitHub if you are interested: https://github.com/benman1/OpenML-Speed-Dating

Alternatively, you can use a download link from the OpenML dataset web page at https://www.openml.org/data/get_csv/13153954/speeddating.arff.

With the dataset loaded, and the libraries installed, we are ready to start cracking.

## How to do it...

Pipelines are a way of describing how machine learning algorithms, including preprocessing steps, can follow one another in a sequence of transformations on top of the raw dataset before applying a final predictor. We will see examples of these concepts in this recipe and throughout this book.

A few things stand out pretty quickly looking at this dataset. We have a lot of categorical features. So, for modeling, we will need to encode them numerically, as in the *Modeling and predicting in Keras* recipe in Chapter 1, *Getting Started with Artificial Intelligence in Python*.

### Encoding ranges numerically

Some of these are actually encoded ranges. This means these are ordinal, in other words, they are categories that are ordered; for example, the `d_interests_correlate` feature contains strings like these:

[[0-0.33], [0.33-1], [-1-0]]

If we were to treat these ranges as categorical variables, we'd lose the information about the order, and we would lose information about how different two values are. However, if we convert them to numbers, we will keep this information and we would be able to apply other numerical transformations on top.

We are going to implement a transformer to plug into an sklearn pipeline in order to convert these range features to numerical features. The basic idea of the conversion is to extract the upper and lower bounds of these ranges as follows:

def encode_ranges(range_str):

splits = range_str[1:-1].split('-')

range_max = splits[-1]

range_min = '-'.join(splits[:-1])

return range_min, range_max

examples = X['d_interests_correlate'].unique()

[encode_ranges(r) for r in examples]

We'll see this for our example:

[('0', '0.33'), ('0.33', '1'), ('-1', '0')]

In order to get numerical features, we can then take the mean between the two bounds. As we've mentioned before, on OpenML, not only are results shown, but also the models are transparent. Therefore, if we want to submit our model, we can only use published modules. We created a module and published it in the `pypi` Python package repository, where you can find the package with the complete code: https://pypi.org/project/openml-speed-dating-pipeline-steps/.

Here is the simplified code for `RangeTransformer`:

from sklearn.base import BaseEstimator, TransformerMixin

import category_encoders.utils as util

class RangeTransformer(BaseEstimator, TransformerMixin):

def __init__(self, range_features=None, suffix='_range/mean', n_jobs=-1):

assert isinstance(range_features, list) or range_features is None

self.range_features = range_features

self.suffix = suffix

self.n_jobs = n_jobs

def fit(self, X, y=None):

return self

def transform(self, X, y=None):

X = util.convert_input(X)

if self.range_features is None:

self.range_features = list(X.columns)

range_data = pd.DataFrame(index=X.index)

for col in self.range_features:

range_data[str(col) + self.suffix] = pd.to_numeric(

self._vectorize(X[col])

)

self.feature_names = list(range_data.columns)

return range_data

def _vectorize(self, s):

return Parallel(n_jobs=self.n_jobs)(

delayed(self._encode_range)(x) for x in s

)

@staticmethod

@lru_cache(maxsize=32)

def _encode_range(range_str):

splits = range_str[1:-1].split('-')

range_max = float(splits[-1])

range_min = float('-'.join(splits[:-1]))

return sum([range_min, range_max]) / 2.0

def get_feature_names(self):

return self.feature_names

This is a shortened snippet of the custom transformer for ranges. Please see the full implementation on GitHub at https://github.com/benman1/OpenML-Speed-Dating.

Please pay attention to how the `fit()` and `transform()` methods are used. We don't need to do anything in the `fit()` method, because we always apply the same static rule. The `transfer()` method applies this rule. We've seen the examples previously. What we do in the `transfer()` method is to iterate over columns. This transformer also shows the use of the parallelization pattern typical to scikit-learn. Additionally, since these ranges repeat a lot, and there aren't so many, we'll use a cache so that, instead of doing costly string transformations, the range value can be retrieved from memory once the range has been processed once.

An important thing about custom transformers in scikit-learn is that they should inherit from `BaseEstimator` and `TransformerMixin`, and implement the `fit()` and `transform()`* *methods. Later on, we will require `get_feature_names()` so we can find out the names of the features generated.

### Deriving higher-order features

Let's implement another transformer. As you may have noticed, we have different types of features that seem to refer to the same personal attributes:

- Personal preferences
- Self-assessment
- Assessment of the other person

It seems clear that differences between any of these features could be significant, such as the importance of sincerity versus how sincere someone assesses a potential partner. Therefore, our next transformer is going to calculate the differences between numerical features. This is supposed to help highlight these differences.

These features are derived from other features, and combine information from two (or potentially more features). Let's see what the `NumericDifferenceTransformer` feature looks like:

import operator

class NumericDifferenceTransformer(BaseEstimator, TransformerMixin):

def __init__(self, features=None,

suffix='_numdist', op=operator.sub, n_jobs=-1

):

assert isinstance(

features, list

) or features is None

self.features = features

self.suffix = suffix

self.op = op

self.n_jobs = n_jobs

def fit(self, X, y=None):

X = util.convert_input(X)

if self.features is None:

self.features = list(

X.select_dtypes(include='number').columns

)

return self

def _col_name(self, col1, col2):

return str(col1) + '_' + str(col2) + self.suffix

def _feature_pairs(self):

feature_pairs = []

for i, col1 in enumerate(self.features[:-1]):

for col2 in self.features[i+1:]:

feature_pairs.append((col1, col2))

return feature_pairs

def transform(self, X, y=None):

X = util.convert_input(X)

feature_pairs = self._feature_pairs()

columns = Parallel(n_jobs=self.n_jobs)(

delayed(self._col_name)(col1, col2)

for col1, col2 in feature_pairs

)

data_cols = Parallel(n_jobs=self.n_jobs)(

delayed(self.op)(X[col1], X[col2])

for col1, col2 in feature_pairs

)

data = pd.concat(data_cols, axis=1)

data.rename(

columns={i: col for i, col in enumerate(columns)},

inplace=True, copy=False

)

data.index = X.index

return data

def get_feature_names(self):

return self.feature_names

This is a custom transformer that calculates differences between numerical features. Please refer to the full implementation in the repository of the OpenML-Speed-Dating library at https://github.com/benman1/OpenML-Speed-Dating.

This transformer has a very similar structure to `RangeTransformer`. Please note the parallelization between columns. One of the arguments to the `__init__()` method is the function that is used to calculate the difference. This is `operator.sub()` by default. The operator library is part of the Python standard library and implements basic operators as functions. The `sub()` function does what it sounds like:

import operator

operator.sub(1, 2) == 1 - 2

# True

This gives us a prefix or functional syntax for standard operators. Since we can pass functions as arguments, this gives us the flexibility to specify different operators between columns.

The `fit()` method this time just collects the names of numerical columns, and we'll use these names in the `transform()` method.

### Combining transformations

We will put these transformers together with `ColumnTransformer` and the pipeline. However, we'll need to make the association between columns and their transformations. We'll define different groups of columns:

range_cols = [

col for col in X.select_dtypes(include='category')

if X[col].apply(lambda x: x.startswith('[')

if isinstance(x, str) else False).any()

]

cat_columns = list(

set(X.select_dtypes(include='category').columns) - set(range_cols)

)

num_columns = list(

X.select_dtypes(include='number').columns

)

Now we have columns that are ranges, columns that are categorical, and columns that are numerical, and we can assign pipeline steps to them.

In our case, we put this together as follows, first in a preprocessor:

from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.feature_selection import SelectKBest, f_classif

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import FunctionTransformer

import category_encoders as ce

import openml_speed_dating_pipeline_steps as pipeline_steps

preprocessor = ColumnTransformer(

transformers=[

('ranges', Pipeline(steps=[

('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value=-1)),

('encode', pipeline_steps.RangeTransformer())

]), range_cols),

('cat', Pipeline(steps=[

('impute', pipeline_steps.SimpleImputerWithFeatureNames(strategy='constant', fill_value='-1')),

('encode', ce.OneHotEncoder(

cols=None, # all features that it given by ColumnTransformer

handle_unknown='ignore',

use_cat_names=True

)

)

]), cat_columns),

('num', pipeline_steps.SimpleImputerWithFeatureNames(strategy='median'), num_columns),

],

remainder='drop', n_jobs=-1

)

And then we'll put the preprocessing in a pipeline, together with the estimator:

def create_model(n_estimators=100):

return Pipeline(

steps=[('preprocessor', preprocessor),

('numeric_differences', pipeline_steps.NumericDifferenceTransformer()),

('feature_selection', SelectKBest(f_classif, k=20)),

('rf', BalancedRandomForestClassifier(

n_estimators=n_estimators,

)

)]

)

Here is the performance in the test set:

from sklearn.metrics import roc_auc_score, confusion_matrix

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y,

test_size=0.33,

random_state=42,

stratify=y

)

clf = create_model(50)

clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)

auc = roc_auc_score(y_test, y_predicted)

print('auc: {:.3f}'.format(auc))

We get the following performance as an output:

auc: 0.779

This is a very good performance, as you can see comparing it to the leaderboard on OpenML.

## How it works...

It is time to explain basic scikit-learn terminology relevant to this recipe. Neither of these concepts corresponds to existing machine learning algorithms, but to composable modules:

- Transformer (in scikit-learn): A class that is derived from
`sklearn.base.TransformerMixin`; it has`fit()`and`transform()`methods. These involve preprocessing steps or feature selection. - Predictor: A class that is derived from either
`sklearn.base.ClassifierMixin`or`sklearn.base.RegressorMixin`; it has`fit()`and`predict()`methods. These are machine learning estimators, in other words, classifiers or regressors. - Pipeline: An interface that wraps all steps together and gives you a single interface for all steps of the transformation and the resulting estimator. A pipeline again has
`fit()`and`predict()`methods.

There are a few things to point out regarding our approach. As we said before, we have missing values, so we have to impute (meaning replace) missing values with other values. In this case, we replace missing values with -1. In the case of categorical variables, this will be a new category, while in the case of numerical variables, it will become a special value that the classifier will have to handle.

`ColumnTransformer` came with version 0.20 of scikit-learn and was a long-awaited feature. Since then, `ColumnTransformer` can often be seen like this, for example:

`from sklearn.compose import ColumnTransformer, make_column_transformer`

from sklearn.preprocessing import StandardScaler, OneHotEncoder

feature_preprocessing = make_column_transformer(

(StandardScaler(), ['column1', 'column2']),

(OneHotEncoder(), ['column3', 'column4', 'column5'])

)

`feature_preprocessing` can then be used as usual with the `fit()`, `transform()`, and `fit_transform()` methods:

`processed_features = feature_preprocessing.fit_transform(X)`

Here, `X` means our features.

Alternatively, we can put `ColumnTransformer` as a step into a pipeline, for example, like this:

from sklearn.pipeline import make_pipeline

from sklearn.linear_model import LogisticRegression

make_pipeline( feature_preprocessing, LogisticRegression()

)

Our classifier is a modified form of the random forest classifier. A random forest is a collection of decision trees, each trained on random subsets of the training data. The balanced random forest classifier (Chen *et al.*: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf) makes sure that each random subset is balanced between the two classes.

Since `NumericDifferenceTransformer` can provide lots of features, we will incorporate an extra step of model-based feature selection.

## There's more...

You can see the complete example with the speed dating dataset, a few more custom transformers, and an extended imputation class in the GitHub repository of the `openml_speed_dating_pipeline_steps` library and notebook, on GitHub at https://github.com/PacktPublishing/Artificial-Intelligence-with-Python-Cookbook/blob/master/chapter02/Transforming%20Data%20in%20Scikit-Learn.ipynb.

Both `RangeTransformer` and `NumericDifferenceTransformer` could also have been implemented using `FunctionTransformer` in scikit-learn.

`ColumnTransformer` is especially handy for pandas DataFrames or NumPy arrays since it allows the specification of different operations for different subsets of the features. However, another option is `FeatureUnion`, which allows concatenation of the results from different transformations. For yet another way to chain our operations together, please have a look at `PandasPicker` in our repository.

## See also

In this recipe, we used ANOVA f-values for univariate feature selection, which is relatively simple, yet effective. Univariate feature selection methods are usually simple filters or statistical tests that measure the relevance of a feature with regard to the target. There are, however, many different methods for feature selection, and scikit-learn implements a lot of them: https://scikit-learn.org/stable/modules/feature_selection.html.

# Predicting house prices in PyTorch

In this recipe, the aim of the problem is to predict house prices in Ames, Iowa, given 81 features describing the house, area, land, infrastructure, utilities, and much more. The Ames dataset has a nice combination of categorical and continuous features, a good size, and, perhaps most importantly, it doesn't suffer from problems of potential redlining or data entry like other, similar datasets, such as Boston Housing. We'll concentrate on the main aspects of PyTorch modeling here. We'll do online learning, analogous to Keras, in the *Modeling and predicting in Keras *recipe in Chapter 1, *Getting Started with Artificial Intelligence in Python*. If you want to see more details on some of the steps, please look at our notebook on GitHub.

As a little extra, we will also demonstrate neuron importance for the models developed in PyTorch. You can try out different network architectures in PyTorch or model types. The focus in this recipe is on the methodology, not an exhaustive search for the best solution.

## Getting ready

In order to prepare for the recipe, we need to do a few things. We'll download the data as in the previous recipe, *Transforming data in scikit-learn*, and perform some preprocessing by following these steps:

from sklearn.datasets import fetch_openml

data = fetch_openml(data_id=42165, as_frame=True)

You can see the full dataset description at OpenML: https://www.openml.org/d/42165.

Let's look at the features:

import pandas as pd

data_ames = pd.DataFrame(data.data, columns=data.feature_names)

data_ames['SalePrice'] = data.target

data_ames.info()

Here is the DataFrame information:

PyTorch and seaborn are installed by default in Colab. We will assume, even if you are working with your self-hosted install by now, that you'll have the libraries installed.

We'll use one more library, however, `captum`, which allows the inspection of PyTorch models for feature and neuron importance:

!pip install captum

There is one more thing. We'll assume you have a GPU available. If you don't have a GPU in your computer, we'd recommend you try this recipe on Colab. In Colab, you'll have to choose a runtime type with GPU.

After all these preparations, let's see how we can predict house prices.

## How to do it...

The Ames Housing dataset is a small- to mid-sized dataset (1,460 rows) with 81 features, both categorical and numerical. There are no missing values.

In the Keras recipe previously, we've seen how to scale the variables. Scaling is important here because all variables have different scales. Categorical variables need to be converted to numerical types in order to feed them into our model. We have the choice of one-hot encoding, where we create dummy variables for each categorical factor, or ordinal encoding, where we number all factors and replace the strings with these numbers. We could feed the dummy variables in like any other float variable, while ordinal encoding would require the use of embeddings, linear neural network projections that re-order the categories in a multi-dimensional space.

We take the embedding route here:

import numpy as np

from category_encoders.ordinal import OrdinalEncoder

from sklearn.preprocessing import StandardScaler

num_cols = list(data_ames.select_dtypes(include='float'))

cat_cols = list(data_ames.select_dtypes(include='object'))

ordinal_encoder = OrdinalEncoder().fit(

data_ames[cat_cols]

)

standard_scaler = StandardScaler().fit(

data_ames[num_cols]

)

X = pd.DataFrame(

data=np.column_stack([

ordinal_encoder.transform(data_ames[cat_cols]),

standard_scaler.transform(data_ames[num_cols])

]),

columns=cat_cols + num_cols

)

We go through the data analysis, such as correlation and distribution plots, in a lot more detail in the notebook on GitHub.

Now we can split the data into training and test sets, as we did in previous recipes. Here, we add a stratification of the numerical variable. This makes sure that different sections (five of them) are included at equal measure in both training and test sets:

np.random.seed(12)

from sklearn.model_selection import train_test_split

bins = 5

sale_price_bins = pd.qcut(

X['SalePrice'], q=bins, labels=list(range(bins))

)

X_train, X_test, y_train, y_test = train_test_split(

X.drop(columns='SalePrice'),

X['SalePrice'],

random_state=12,

stratify=sale_price_bins

)

Before going ahead, let's look at the importance of the features using a model-independent technique.

Before we run anything, however, let's make sure we are running on the GPU:

device = torch.device('cuda')

torch.backends.cudnn.benchmark = True

Let's build our PyTorch model, similar to the *Classifying in scikit-learn*, *Keras*, and *PyTorch* recipe in Chapter 1, *Getting Started with Artificial Intelligence in Python*.

We'll implement a neural network regression with batch inputs using PyTorch. This will involve the following steps:

- Converting data to torch tensors
- Defining the model architecture
- Defining the loss criterion and optimizer
- Creating a data loader for batches
- Running the training

Without further preamble, let's get to it:

- Begin by converting the data to torch tensors:

from torch.autograd import Variable

num_features = list(

set(num_cols) - set(['SalePrice', 'Id'])

)

X_train_num_pt = Variable(

torch.cuda.FloatTensor(

X_train[num_features].values

)

)

X_train_cat_pt = Variable(

torch.cuda.LongTensor(

X_train[cat_cols].values

)

)

y_train_pt = Variable(

torch.cuda.FloatTensor(y_train.values)

).view(-1, 1)

X_test_num_pt = Variable(

torch.cuda.FloatTensor(

X_test[num_features].values

)

)

X_test_cat_pt = Variable(

torch.cuda.LongTensor(

X_test[cat_cols].values

).long()

)

y_test_pt = Variable(

torch.cuda.FloatTensor(y_test.values)

).view(-1, 1)

This makes sure we load our numerical and categorical data into separate variables, similar to NumPy. If you mix data types in a single variable (array/matrix), they'll become objects. We want to get our numerical variables as floats, and the categorical variables as long (or int) indexing our categories. We also separate the training and test sets.

Clearly, an ID variable should not be important in a model. In the worst case, it could introduce a target leak if there's any correlation of the ID with the target. We've removed it from further processing.

- Define the model architecture:

class RegressionModel(torch.nn.Module):

def __init__(self, X, num_cols, cat_cols, device=torch.device('cuda'), embed_dim=2, hidden_layer_dim=2, p=0.5):

super(RegressionModel, self).__init__()

self.num_cols = num_cols

self.cat_cols = cat_cols

self.embed_dim = embed_dim

self.hidden_layer_dim = hidden_layer_dim

self.embeddings = [

torch.nn.Embedding(

num_embeddings=len(X[col].unique()),

embedding_dim=embed_dim

).to(device)

for col in cat_cols

]

hidden_dim = len(num_cols) + len(cat_cols) * embed_dim,

# hidden layer

self.hidden = torch.nn.Linear(torch.IntTensor(hidden_dim), hidden_layer_dim).to(device)

self.dropout_layer = torch.nn.Dropout(p=p).to(device)

self.hidden_act = torch.nn.ReLU().to(device)

# output layer

self.output = torch.nn.Linear(hidden_layer_dim, 1).to(device)

def forward(self, num_inputs, cat_inputs):

'''Forward method with two input variables -

numeric and categorical.

'''

cat_x = [

torch.squeeze(embed(cat_inputs[:, i] - 1))

for i, embed in enumerate(self.embeddings)

]

x = torch.cat(cat_x + [num_inputs], dim=1)

x = self.hidden(x)

x = self.dropout_layer(x)

x = self.hidden_act(x)

y_pred = self.output(x)

return y_pred

house_model = RegressionModel(

data_ames, num_features, cat_cols

)

Our activation function on the two linear layers (dense layers, in Keras terminology) is the **rectified linear unit activation** (**ReLU**) function. Please note that we couldn't have encapsulated the same architecture (easily) as a sequential model because of the different operations occurring on categorical and numerical types.

- Next, define the loss criterion and optimizer. We take the
**mean square error**(**MSE**) as the loss and stochastic gradient descent as our optimization algorithm:

criterion = torch.nn.MSELoss().to(device)

optimizer = torch.optim.SGD(house_model.parameters(), lr=0.001)

- Now, create a data loader to input a batch of data at a time:

data_batch = torch.utils.data.TensorDataset(

X_train_num_pt, X_train_cat_pt, y_train_pt

)

dataloader = torch.utils.data.DataLoader(

data_batch, batch_size=10, shuffle=True

)

We set a batch size of 10. Now we can do our training.

- Run the training!

Since this seems so much more verbose than what we saw in Keras in the *Classifying in scikit-learn, Keras, and PyTorch *recipe in Chapter 1, *Getting Started with Artificial Intelligence in Python*, we commented this code quite heavily. Basically, we have to loop over epochs, and within each epoch an inference is performance, an error is calculated, and the optimizer applies the adjustments according to the error.

This is the loop over epochs without the inner loop for training:

from tqdm.notebook import trange

train_losses, test_losses = [], []

n_epochs = 30

for epoch in trange(n_epochs):

train_loss, test_loss = 0, 0

# training code will go here:

# <...>

# print the errors in training and test:

if epoch % 10 == 0 :

print(

'Epoch: {}/{}\t'.format(epoch, 1000),

'Training Loss: {:.3f}\t'.format(

train_loss / len(dataloader)

),

'Test Loss: {:.3f}'.format(

test_loss / len(dataloader)

)

)

The training is performed inside this loop over all the batches of the training data. This looks as follows:

for (x_train_num_batch,

x_train_cat_batch,

y_train_batch) in dataloader:

# predict y by passing x to the model

(x_train_num_batch,

x_train_cat_batch, y_train_batch) = (

x_train_num_batch.to(device),

x_train_cat_batch.to(device),

y_train_batch.to(device)

)

pred_ytrain = house_model.forward(

x_train_num_batch, x_train_cat_batch

)

# calculate and print loss:

loss = torch.sqrt(

criterion(pred_ytrain, y_train_batch)

)

# zero gradients, perform a backward pass,

# and update the weights.

optimizer.zero_grad()

loss.backward()

optimizer.step()

train_loss += loss.item()

with torch.no_grad():

house_model.eval()

pred_ytest = house_model.forward(

X_test_num_pt, X_test_cat_pt

)

test_loss += torch.sqrt(

criterion(pred_ytest, y_test_pt)

)

train_losses.append(train_loss / len(dataloader))

test_losses.append(test_loss / len(dataloader))

This is the output we get. TQDM provides us with a helpful progress bar. At every tenth epoch, we print an update to show training and validation performance:

Please note that we take the square root of `nn.MSELoss` because `nn.MSELoss` in PyTorch is defined as follows:

`((input-target)**2).mean()`

Let's plot how our model performs for training and validation datasets during training:

plt.plot(

np.array(train_losses).reshape((n_epochs, -1)).mean(axis=1),

label='Training loss'

)

plt.plot(

np.array(test_losses).reshape((n_epochs, -1)).mean(axis=1),

label='Validation loss'

)

plt.legend(frameon=False)

plt.xlabel('epochs')

plt.ylabel('MSE')

The following diagram shows the resulting plot:

We stopped our training just in time before our validation loss stopped decreasing.

We can also rank and bin our target variable and plot the predictions against it in order to see how the model is performing across the whole spectrum of house prices. This is to avoid the situation in regression, especially with MSE as the loss, that you only predict well for a mid-range of values, close to the mean, but don't do well for anything else. You can find the code for this in the notebook on GitHub. This is called a lift chart (here with 10 bins):

We can see that the model, in fact, predicts very closely across the whole range of house prices. In fact, we get a Spearman rank correlation of about 93% with very high significance, which confirms that this model performs with high accuracy.

## How it works...

The deep learning neural network frameworks use different optimization algorithms. Popular among them are **Stochastic Gradient Descent **(**SGD**), **Root Mean Square Propogation** (**RMSProp**), and **Adaptive Moment Estimation** (**ADAM**).

We defined stochastic gradient descent as our optimization algorithm. Alternatively, we could have defined other optimizers:

opt_SGD = torch.optim.SGD(net_SGD.parameters(), lr=LR)

opt_Momentum = torch.optim.SGD(net_Momentum.parameters(), lr=LR, momentum=0.6)

opt_RMSprop = torch.optim.RMSprop(net_RMSprop.parameters(), lr=LR, alpha=0.1)

opt_Adam = torch.optim.Adam(net_Adam.parameters(), lr=LR, betas=(0.8, 0.98))

SGD works the same as gradient descent except that it works on a single example at a time. The interesting part is that the convergence is similar to the gradient descent and is easier on the computer memory.

RMSProp works by adapting the learning rates of the algorithm according to the gradient signs. The simplest of the variants checks the last two gradient signs and then adapts the learning rate by increasing it by a fraction if they are the same, or decreases it by a fraction if they are different.

ADAM is one of the most popular optimizers. It's an adaptive learning algorithm that changes the learning rate according to the first and second moments of the gradients.

Captum is a tool that can help us understand the ins and outs of the neural network model learned on the datasets. It can assist in learning the following:

- Feature importance
- Layer importance
- Neuron importance

This is very important in learning interpretable neural networks. Here, integrated gradients have been applied to understand feature importance. Later, neuron importance is also demonstrated by using the layer conductance method.

## There's more...

Given that we have our neural network defined and trained, let's find the important features and neurons using the captum library:

from captum.attr import (

IntegratedGradients,

LayerConductance,

NeuronConductance

)

house_model.cpu()

for embedding in house_model.embeddings:

embedding.cpu()

house_model.cpu()

ing_house = IntegratedGradients(forward_func=house_model.forward, )

#X_test_cat_pt.requires_grad_()

X_test_num_pt.requires_grad_()

attr, delta = ing_house.attribute(

X_test_num_pt.cpu(),

target=None,

return_convergence_delta=True,

additional_forward_args=X_test_cat_pt.cpu()

)

attr = attr.detach().numpy()

Now, we have a NumPy array of feature importances.

Layer and neuron importance can also be obtained using this tool. Let's look at the neuron importances of our first layer. We can pass on `house_model.act1`, which is the ReLU activation function on top of the first linear layer:

cond_layer1 = LayerConductance(house_model, house_model.act1)

cond_vals = cond_layer1.attribute(X_test, target=None)

cond_vals = cond_vals.detach().numpy()

df_neuron = pd.DataFrame(data = np.mean(cond_vals, axis=0), columns=['Neuron Importance'])

df_neuron['Neuron'] = range(10)

This is how it looks:

The diagram shows the neuron importances. Apparently, one neuron is just not important.

We can also see the most important variables by sorting the NumPy array we've obtained earlier:

df_feat = pd.DataFrame(np.mean(attr, axis=0), columns=['feature importance'] )

df_feat['features'] = num_features

df_feat.sort_values(

by='feature importance', ascending=False

).head(10)

So here's a list of the 10 most important variables:

Often, feature importances can help us to both understand the model and prune our model to become less complex (and hopefully less overfitted).

## See also

The PyTorch documentation includes everything you need to know about layer types, data loading, losses, metrics, and training: https://pytorch.org/docs/stable/nn.html

A detailed discussion about optimization algorithms can be found in the following article: https://imaddabbura.github.io/post/gradient-descent-algorithm/. Geoffrey Hinton and others explain mini-batch gradient descent in a presentation slide deck: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf. Finally, you can find all the details on ADAM in the article that introduced it: https://arxiv.org/abs/1412.6980.

Captum provides a lot of functionality as regards the interpretability and model inspection of PyTorch models. It's worth having a look at its comprehensive documentation at https://captum.ai/. Details can be found in the original paper at https://arxiv.org/pdf/1703.01365.pdf.

# Live decisioning customer values

Let's assume we have the following scenario: we have a list of customers to call in order to sell them our product. Each phone call costs money in call center personal salaries, so we want to reduce these costs as much as possible. We have certain information about each customer that could help us determine whether they are likely to buy. After every call, we can update our model. The main goal is to call only the most promising customers and to improve our insights into which customers are more likely to pay for our product.

In this recipe, we will approach this with active learning, a strategy where we actively decide what to explore (and learn) next. Our model will help decide whom to call. Because we will update our model after each query (phone call), we will use online learning models.

## Getting ready

We'll prepare for our recipe by downloading our dataset and installing a few libraries.

Again, we will get the data from OpenML:

!pip install -q openml

import openml

dataset = openml.datasets.get_dataset(1461)

X, y, categorical_indicator, _ = dataset.get_data(

dataset_format='DataFrame',

target=dataset.default_target_attribute

)

categorical_features = X.columns[categorical_indicator]

numeric_features = X.columns[

[not(i) for i in categorical_indicator]

]

This dataset is called `bank-marketing`, and you can see a description on OpenML at https://www.openml.org/d/1461.

For each row, describing a single person, we have different features, numerical and categorical, that tell us about demographics and customer history.

To model the likelihood of customers signing up for our product, we will use the scikit-multiflow package that specializes in online models. We will also use the `category_encoders` package again:

!pip install scikit-multiflow category_encoders

With these two libraries in place, we can start the recipe.

## How to do it...

We need to implement an exploration strategy and a model that is being continuously updated. We are using the online version of the random forest, the Hoeffding Tree, as our model. We are estimating the uncertainties at every step, and based on that we will return a candidate to call next.

As always, we will need to define a few preprocessing steps:

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import FunctionTransformer

import category_encoders as ce

ordinal_encoder = ce.OrdinalEncoder(

cols=None, # all features that it encounters

handle_missing='return_nan',

handle_unknown='ignore'

).fit(X)

preprocessor = ColumnTransformer(

transformers=[

('cat', ordinal_encoder, categorical_features),

('num', FunctionTransformer(validate=False), numeric_features)

])

preprocessor = preprocessor.fit(X)

Then we come to our active learning approach itself. This is inspired by `modAL.models.ActiveLearner`:

import numpy as np

from skmultiflow.trees.hoeffding_tree import HoeffdingTreeClassifier

from sklearn.metrics import roc_auc_score

import random

class ActivePipeline:

def __init__(self, model, preprocessor, class_weights):

self.model = model

self.preprocessor = preprocessor

self.class_weights = class_weights

@staticmethod

def values(X):

if isinstance(X, (np.ndarray, np.int64)):

return X

else:

return X.values

def preprocess(self, X):

X_ = pd.DataFrame(

data=self.values(X),

columns=[

'V1', 'V2', 'V3', 'V4',

'V5', 'V6', 'V7', 'V8',

'V9', 'V10', 'V11', 'V12',

'V13', 'V14', 'V15', 'V16'

])

return self.preprocessor.transform(X_)

def fit(self, X, ys):

weights = [self.class_weights[y] for y in ys]

self.model.fit(self.preprocess(X), self.values(ys))

def update(self, X, ys):

if isinstance(ys, (int, float)):

weight = self.class_weights[y]

else:

weight = [self.class_weights[y] for y in ys]

self.model.partial_fit(

self.preprocess(X),

self.values(ys),

weight

)

def predict(self, X):

return self.model.predict(

self.preprocess(X)

)

def predict_proba(self, X):

return self.model.predict_proba(

self.preprocess(X)

)

@staticmethod

def entropy(preds):

return -np.sum(

np.log((preds + 1e-15) * preds)

/ np.log(np.prod(preds.size))

)

def max_margin_uncertainty(self, X, method: str='entropy',

exploitation: float=0.9, favor_class: int=1, k: int=1

):

'''similar to modAL.uncertainty.margin_uncertainty

'''

probs = self.predict_proba(X)

if method=='margin':

uncertainties = np.abs(probs[:,2] - probs[:, 1]) / 2.0

elif method=='entropy':

uncertainties = np.apply_along_axis(self.entropy, 1, probs[:, (1,2)])

else: raise(ValueError('method not implemented!'))

if favor_class is None:

weights = uncertainties

else: weights = (1.0 - exploitation) * uncertainties + exploitation * probs[:, favor_class]

if self.sampling:

ind = random.choices(

range(len(uncertainties)), weights, k=k

)

else:

ind = np.argsort(weights, axis=0)[::-1][:k]

return ind, np.mean(uncertainties[ind])

def score(self, X, y, scale=True):

probs = self.predict_proba(X, probability=2)

if scale:

probs = np.clip(probs - np.mean(probs) + 0.5, 0, 1)

return roc_auc_score(y, probs)

Again, we create a scikit-learn-compatible class. It basically holds a machine learning model and a data preprocessor. We implement `fit()` and `predict()`, but also `score()` to get a model performance. We also implement an `update()` method that calls `partial_fit()` of the machine learning model. Calling `partial_fit()` instead of `fit()` considerably speeds up the computations, because we don't have to start from scratch every time we get new data.

Here's how to create the active learning pipeline:

active_pipeline = ActivePipeline(

HoeffdingTreeClassifier(),

preprocessor,

class_weights.to_dict()

)

active_pipeline.model.classes = [0, 1, 2]

We can run different simulations on our dataset with this setup. For example, we can compare a lot of experimentation (0.5 exploitation) against only exploitation (1.0), or no learning at all after the first batch. We basically go through a loop:

- Via
`active_pipeline.``max_margin_uncertainty()`, we present data to the active pipeline, and get a number of data points that integrate uncertainty and target predictions according to our preferences of the integration method. - Once we get the actual results for these data points, we can update our model:
`active_pipeline.update()`.

You can see an example of this in the notebook on GitHub.

We can see that curious wins out after the first few examples. Exploitation is actually the least successful scheme. By not updating the model, performance deteriorates over time:

This is an ideal scenario for active learning or reinforcement learning, because, not unlike in reinforcement learning, uncertainty can be an additional criterion, apart from positive expectation, from a customer. Over time, this entropy reduction-seeking behavior reduces as the model's understanding of customers improves.

## How it works...

It's worth delving a bit more into a few of the concepts and strategies employed in this recipe.

### Active learning

Active learning means that we can actively query for more information; in other words, exploration is part of our strategy. This can be useful in scenarios where we have to actively decide what to learn, and where what we learn influences not only how much our model learns and how well, but also how much return on an investment we can get.

### Hoeffding Tree

The Hoeffding Tree (also known as the *Very Fast Decision Tree*, *VFDT* for short) was introduced in 2001 by Geoff Hulten and others (*Mining time-changing data streams*). It is an incrementally growing decision tree for streamed data. Tree nodes are expanded based on the Hoeffding bound (or additive Chernoff bound). It was theoretically shown that, given sufficient training data, a model learned by the Hoeffding tree converges very closely to the one built by a non-incremental learner.

The Hoeffding bound is defined as follows:

It's important to note that the Hoeffding Tree doesn't deal with data distributions that change over time.

### Class weighting

Since we are dealing with an imbalanced dataset, let's use class weights. This basically means that we are upsampling the minority (signing up) class and downsampling the majority class (not signing up).

The formula for the class weights is as follows:

Similarly, in Python, we can write the following:

class_weights = len(X) / (y.astype(int).value_counts() * 2)

We can then use these class weights for sampling.

We'll close the recipe with a few more pointers.

## See also

Only a few models in scikit-learn allow incremental or online learning. Refer to the list at https://scikit-learn.org/stable/modules/computing.html.

A few linear models include the `partial_fit()` method. The scikit-multiflow library specializes in incremental and online/streaming models: https://scikit-multiflow.github.io/

You can find more resources and ideas regarding active learning from a recent review that concentrates on biomedical image processing (Samuel Budd and others, *A Survey on Active Learning and Human-in-the-Loop Deep Learning for Medical Image Analysis*, 2019; https://arxiv.org/abs/1910.02923).

Our approach is inspired by the modalAI Python active learning package, which you can find at https://modal-python.readthedocs.io/. We recommend you check it out if you are interested in active learning approaches. A few more Python packages are available, as follows:

- Alipy: Active Learning in Python: http://parnec.nuaa.edu.cn/huangsj/alipy/
- Active Learning: A Google repo about active learning: https://github.com/google/active-learning

One of the main decisions in active learning is the trade-off between exploration and exploitation. You can find out more about this in a paper called *Exploration versus exploitation in active learning: a Bayesian approach*: http://www.vincentlemaire-labs.fr/publis/ijcnn_2_2010_camera_ready.pdf

# Battling algorithmic bias

**Correctional Offender Management Profiling for Alternative Sanctions** (**COMPAS**) is a commercial algorithm that assigns risk scores to criminal defendants based on their case records. This risk score corresponds to the likelihood of reoffending (recidivism) and of committing violent crimes, and this score is used in court to help determine sentences. The ProPublica organization obtained scores and data in 1 county in Florida, containing data on about 7,000 people. After waiting for 2 years to see who reoffended, they audited the COMPAS model in 2016 and found very concerning issues with the model. Among the ProPublica findings was discrimination according to gender, race, and ethnicity, particularly in the case of over-predicting recidivism for ethnic minorities.

Generally, we would think that justice should be blind to gender or race. This means that court decisions should not take these sensitive variables like race or gender into account. However, even if we omit them from our model training, these sensitive variables might be correlated to some of the other variables, and therefore they can still affect decisions, to the detriment of protected groups such as minorities or women.

In this section, we are going to work with the COMPAS modeling dataset as provided by ProPublica. We are going to check for racial bias, and then create a model to remove it. You can find the original analysis by ProPublica at https://github.com/propublica/compas-analysis.

## Getting ready

Before we can start, we'll first download the data, mention issues in preprocessing, and install the libraries.

Let's get the data:

!wget https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv

import pandas as pd

date_cols = [

'compas_screening_date', 'c_offense_date',

'c_arrest_date', 'r_offense_date',

'vr_offense_date', 'screening_date',

'v_screening_date', 'c_jail_in',

'c_jail_out', 'dob', 'in_custody',

'out_custody'

]

data = pd.read_csv(

'compas-scores-two-years.csv',

parse_dates=date_cols

)

Each row represents the risk of violence and the risk of recidivism scores for an inmate. The final column, `two_year_recid`, indicates our target.

ProPublica compiled their dataset from different sources, which they matched up according to the names of offenders:

- Criminal records from the Broward County Clerk's Office website
- Public incarceration records from the Florida Department of Corrections website
- COMPAS scores, which they obtained through a public record information request

We can highlight a few issues in the dataset:

- The column race is a protected category. It should not be used as a feature for model training, but as a control.
- There are full names in the dataset, which will not be useful, or might even give away the ethnicity of the inmates.
- There are case numbers in the dataset. These will likely not be useful for training a model, although they might have some target leakage in the sense that increasing case numbers might give an indication of the time, and there could be a drift effect in the targets over time.
- There are missing values. We will need to carry out imputation.
- There are date stamps. These will probably not be useful and might even come with associated problems (see point 3). However, we can convert these features into UNIX epochs, which indicates the number of seconds that have elapsed since 1970, and then calculate time periods between date stamps, for example, by repurposing
`NumericDifferenceTransformer`that we saw in an earlier recipe. We can then use these periods as model features rather than the date stamps. - We have several categorical variables.
- The charge description (
`c_charge_desc`) might need to be cleaned up.

Mathias Barenstein has pointed out (https://arxiv.org/abs/1906.04711) a data processing error in ProPublica's cutoff that resulted in keeping 40% more recidivists than they should have. We'll apply his correction to the two-year cutoff:

import datetime

indexes = data.compas_screening_date <= pd.Timestamp(datetime.date(2014, 4, 1))

assert indexes.sum() == 6216

data = data[indexes]

We will use a few libraries in this recipe, which can be installed as follows:

!pip install category-encoders

`category-encoders` is a library that provides functionality for categorical encoding beyond what scikit-learn provides.

## How to do it...

Let's get some basic terminology out of the way first. We need to come up with metrics for fairness. But what does fairness (or, if we look at unfairness, bias) mean?

Fairness can refer to two very different concepts:

- [
**equal opportunity**]: There should be no difference in the relative ratios of predicted by the model versus actually true; or - [
**equal outcome**]: There should be no difference between model outcomes at all.

The first is also called **equal odds**, while the latter refers to **equal false positive rates**. While equal opportunity means that each group should be given the same chance regardless of their group, the equal outcome strategy implies that the underperforming group should be given more lenience or chances relative to the other group(s).

We'll go with the idea of false positive rates, which intuitively appeals, and which is enshrined in law in many jurisdictions in the case of equal employment opportunities. We'll provide a few resources about these terms in the *See also* section.

Therefore, the logic for the impact calculation is based on values in the confusion matrix, most importantly, false positives, which we've just mentioned. These cases are predicted positive even though they are actually negative; in our case, people predicted as reoffenders, who are not reoffenders. Let's write a function for this:

def confusion_metrics(actual, scores, threshold):

y_predicted = scores.apply(

lambda x: x >= threshold

).values y_true = actual.values TP = (

(y_true==y_predicted) &

(y_predicted==1)

).astype(int) FP = (

(y_true!=y_predicted) &

(y_predicted==1)

).astype(int) TN = (

(y_true==y_predicted) &

(y_predicted==0)

).astype(int) FN = (

(y_true!=y_predicted) &

(y_predicted==0)

).astype(int)

return TP, FP, TN, FN

We can now use this function in order to summarize the impact on particular groups with this code:

def calculate_impacts(data, sensitive_column='race', recid_col='is_recid', score_col='decile_score.1', threshold=5.0): if sensitive_column == 'race': norm_group = 'Caucasian' elif sensitive_column == 'sex': norm_group = 'Male' else: raise ValueError('sensitive column not implemented') TP, FP, TN, FN = confusion_metrics(

actual=data[recid_col],

scores=data[score_col],

threshold=threshold

) impact = pd.DataFrame( data=np.column_stack([ FP, TN, FN, TN, data[sensitive_column].values, data[recid_col].values, data[score_col].values / 10.0 ]), columns=['FP', 'TP', 'FN', 'TN', 'sensitive', 'reoffend', 'score'] ).groupby(by='sensitive').agg({ 'reoffend': 'sum', 'score': 'sum', 'sensitive': 'count', 'FP': 'sum', 'TP': 'sum', 'FN': 'sum', 'TN': 'sum' }).rename( columns={'sensitive': 'N'} )

impact['FPR'] = impact['FP'] / (impact['FP'] + impact['TN']) impact['FNR'] = impact['FN'] / (impact['FN'] + impact['TP']) impact['reoffend'] = impact['reoffend'] / impact['N'] impact['score'] = impact['score'] / impact['N'] impact['DFP'] = impact['FPR'] / impact.loc[norm_group, 'FPR'] impact['DFN'] = impact['FNR'] / impact.loc[norm_group, 'FNR'] return impact.drop(columns=['FP', 'TP', 'FN', 'TN'])

This first calculates the confusion matrix with true positives and false negatives, and then encodes the **adverse impact ratio** (**AIR**), known in statistics also as the **Relative Risk Ratio** (**RRR**). Given any performance metric, we can write the following:

This expresses an expectation that the metric for the protected group (African-Americans) should be the same as the metric for the norm group (Caucasians). In this case, we'll get 1.0. If the metric of the protected group is more than 20 percentage points different to the norm group (that is, lower than 0.8 or higher than 1.2), we'll flag it as a significant discrimination.

**Norm group**: **a norm group**, also known as a **standardization sample** or **norming group**, is a sample of the dataset that represents the population to which the statistic is intended to be compared. In the context of bias, its legal definition is the group with the highest success, but in some contexts, the entire dataset or the most frequent group are taken as the baseline instead. Pragmatically, we take the white group, since they are the biggest group, and the group for which the model works best.

In the preceding function, we calculate the false positive rates by sensitive group. We can then check whether the false positive rates for African-Americans versus Caucasians are disproportionate, or rather whether the false positive rates for African-Americans are much higher. This would mean that African-Americans get flagged much more often as repeat offenders than they should be. We find that this is indeed the case:

A short explanation about this table follows:

- reoffend: frequencies of reoffending
- score: average score for the group
- N: the total number of people in the group
- FPR: false positive rates
- FNR: false negative rates
- DFP: disproportionate false positive
- DFN: disproportionate false negative

The last FPR and FNR columns together can give an idea about the general quality of the model. If both are high, the model just doesn't perform well for the particular group. The last two columns express the adverse impact ratio of FPR and FNR ratios, respectively, which is what we'll mostly focus on. We need to reduce the racial bias in the model by reducing the FPR of African-Americans to a tolerable level.

Let's do some preprocessing and then we'll build the model:

`from sklearn.feature_extraction.text import CountVectorizer`

from category_encoders.one_hot import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

charge_desc = data['c_charge_desc'].apply(lambda x: x if isinstance(x, str) else '')

count_vectorizer = CountVectorizer(

max_df=0.85, stop_words='english',

max_features=100, decode_error='ignore'

)

charge_desc_features = count_vectorizer.fit_transform(charge_desc)

one_hot_encoder = OneHotEncoder()

charge_degree_features = one_hot_encoder.fit_transform(

data['c_charge_degree']

)

data['race_black'] = data['race'].apply(lambda x: x == 'African-American').astype(int)

stratification = data['race_black'] + (data['is_recid']).astype(int) * 2

`CountVectorizer` counts a vocabulary of words, indicating how many times each word is used. This is called a bag-of-words representation, and we apply it to the charge description column. We exclude English stop words, which are very common words such as prepositions (such as **on** or **at**) and personal pronouns (for example **I **or **me**); we also limit the vocabulary to 100 words and words that don't appear in more than 85% of the fields.

We apply dummy encoding (one-hot encoding) to the charge degree.

Why do we use two different transformations? Basically, the description is a textual description of why someone was charged with a crime. Every field is different. If we used one-hot encoding, every field would get their own dummy variable, and we wouldn't be able to see any commonalities between fields.

In the end, we create a new variable for stratification in order to make sure that we have similar proportions in the training and test datasets for both recidivism (our target variable) and whether someone is African-American. This will help us to calculate metrics to check for discrimination:

```
y = data['is_recid']
X = pd.DataFrame(
data=np.column_stack(
[data[['juv_fel_count', 'juv_misd_count',
```

'juv_other_count', 'priors_count', 'days_b_screening_arrest']],
charge_degree_features,
charge_desc_features.todense()
]
),
columns=['juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'days_b_screening_arrest'] \
+ one_hot_encoder.get_feature_names() \
+ count_vectorizer.get_feature_names(),
index=data.index
)
X['jailed_days'] = (data['c_jail_out'] - data['c_jail_in']).apply(lambda x: abs(x.days))
X['waiting_jail_days'] = (data['c_jail_in'] - data['c_offense_date']).apply(lambda x: abs(x.days))
X['waiting_arrest_days'] = (data['c_arrest_date'] - data['c_offense_date']).apply(lambda x: abs(x.days))
X.fillna(0, inplace=True)

columns = list(X.columns)
X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.33,

random_state=42,

stratify=stratification

) # we stratify by black and the target

We do some data engineering, deriving variables to record how many days someone has spent in jail, has waited for a trial, or has waited for an arrest.

We'll build a neural network model using jax similar to the one we've encountered in the *Classifying in scikit-learn, Keras, and PyTorch *recipe in Chapter 1, *Getting Started with Artificial Intelligence in Python*. This time, we'll do a fully fledged implementation:

```
import jax.numpy as jnp
from jax import grad, jit, vmap, ops, lax
import numpy.random as npr
```

import numpy as onp
import random
from tqdm import trange
from sklearn.base import ClassifierMixin
from sklearn.preprocessing import StandardScaler
class JAXLearner(ClassifierMixin):
def __init__(self, layer_sizes=[10, 5, 1], epochs=20, batch_size=500, lr=1e-2):
self.params = self.construct_network(layer_sizes)
self.perex_grads = jit(grad(self.error))
self.epochs = epochs
self.batch_size = batch_size
self.lr = lr
@staticmethod
def construct_network(layer_sizes=[10, 5, 1]):
'''Please make sure your final layer corresponds to targets in dimensions.
'''
def init_layer(n_in, n_out):
W = npr.randn(n_in, n_out)
b = npr.randn(n_out,)
return W, b
return list(map(init_layer, layer_sizes[:-1], layer_sizes[1:]))
@staticmethod
def sigmoid(X): # or tanh
return 1/(1+jnp.exp(-X))
def _predict(self, inputs):
for W, b in self.params:
outputs = jnp.dot(inputs, W) + b
inputs = self.sigmoid(outputs)
return outputs
def predict(self, inputs):
inputs = self.standard_scaler.transform(inputs)
return onp.asarray(self._predict(inputs))
@staticmethod
def mse(preds, targets, other=None):
return jnp.sqrt(jnp.sum((preds - targets)**2))
@staticmethod

def penalized_mse(preds, targets, sensitive):

err = jnp.sum((preds - targets)**2)

err_s = jnp.sum((preds * sensitive - targets * sensitive)**2)

penalty = jnp.clip(err_s / err, 1.0, 2.0)

return err * penalty
def error(self, params, inputs, targets, sensitive):
preds = self._predict(inputs)
return self.penalized_mse(preds, targets, sensitive)
def fit(self, X, y, sensitive):
self.standard_scaler = StandardScaler()
X = self.standard_scaler.fit_transform(X)
N = X.shape[0]
indexes = list(range(N))
steps_per_epoch = N // self.batch_size
for epoch in trange(self.epochs, desc='training'):
random.shuffle(indexes)
index_offset = 0
for step in trange(steps_per_epoch, desc='iteration'):
grads = self.perex_grads(
self.params,
X[indexes[index_offset:index_offset+self.batch_size], :],
y[indexes[index_offset:index_offset+self.batch_size]],
sensitive[indexes[index_offset:index_offset+self.batch_size]]
)
# print(grads)
self.params = [(W - self.lr * dW, b - self.lr * db)
for (W, b), (dW, db) in zip(self.params, grads)]
index_offset += self.batch_size

This is a scikit-learn wrapper of a JAX neural network. For scikit-learn compatibility, we inherit from `ClassifierMixin` and implement `fit()` and `predict()`. The most important part here is the penalized MSE method, which, in addition to model predictions and targets, takes into account a sensitive variable.

Let's train it and check the performance. Please note that we feed in `X`, `y`, and `sensitive_train`, which we define as the indicator variable for African-American for the training dataset:

`sensitive_train = X_train.join(`

data, rsuffix='_right'

)['race_black']

jax_learner = JAXLearner([X.values.shape[1], 100, 1])

jax_learner.fit(

X_train.values,

y_train.values,

sensitive_train.values

)

We visualize the statistics as follows:

X_predicted = pd.DataFrame(

data=jax_learner.predict(

X_test.values

) * 10,

columns=['score'],

index=X_test.index

).join(

data[['sex', 'race', 'is_recid']],

rsuffix='_right'

)

calculate_impacts(X_predicted, score_col='score')

This is the table we get:

We can see that the disproportionate false positive rate for African-Americans is very close to (and even lower than) 1.0, which is what we wanted. The test set is small and doesn't contain enough samples for Asians and native Americans, so we can't calculate meaningful statistics for those groups. We could, however, extend our approach to encompass these two groups as well if we wanted to ensure that these two groups had equal false positive rates.

## How it works...

The keys for this to work are custom objective functions or loss functions. This is far from straightforward in scikit-learn, although we will show an implementation in the following section.

Generally, there are different possibilities for implementing your own cost or loss functions.

- LightGBM, Catboost, and XGBoost each provide an interface with many loss functions and the ability to define custom loss functions.
- PyTorch and Keras (TensorFlow) provide an interface.
- You can implement your model from scratch (this is what we've done in the main recipe).

Scikit-learn generally does not provide a public API for defining your own loss functions. For many algorithms, there is only a single choice, and sometimes there are a couple of alternatives. The rationale in the case of split criteria with trees is that loss functions have to be performant, and only Cython implementations will guarantee this. This is only available in a non-public API, which means it is more difficult to use.

Finally, when there's no (straightforward) way to implement a custom loss, you can wrap your algorithms in a general optimization scheme such as genetic algorithms.

In neural networks, as long as you provide a differentiable loss function, you can plug in anything you want.

Basically, we were able to encode the adverse impact as a penalty term with the **Mean Squared Error** (**MSE**) function. It is based on the MSE that we've mentioned before, but has a penalty term for adverse impact. Let's look again at the loss function:

@staticmethod def penalized_mse(preds, targets, sensitive): err = jnp.sum((preds - targets)**2) err_s = jnp.sum((preds * sensitive - targets * sensitive)**2) penalty = jnp.clip(err_s / err, 1.0, 2.0) return err * penalty

The first thing to note is that instead of two variables, we pass three variables. `sensitive` is the variable relevant to the adverse impact, indicating if we have a person from a protected group.

The calculation works as follows:

- We calculate the MSE overall, err, from model predictions and targets.
- We calculate the MSE for the protected group,
`err_s`. - We take the ratio of the MSE for the protected group over the MSE overall (AIR) and limit it to between 1.0 and 2.0. We don't want values lower than 1, because we are only interested in the AIR if it's negatively affecting the protected group.
- We then multiply AIR by the overall MSE.

As for 2, the MSE can simply be calculated by multiplying the predictions and targets, each by `sensitive`. That would cancel out all points, where sensitive is equal to 0.

As for 4, it might seem that this would cancel out the overall error, but we see that it actually seems to work. We probably could have added the two terms as well to give both errors a similar importance.

We use the autograd functionality in Jax to differentiate this.

## There's more...

In the following, we'll use the non-public scikit-learn API to implement a custom split criterion for decision trees. We'll use this to train a random forest model with the COMPAS dataset:

```
%%cython
from sklearn.tree._criterion cimport ClassificationCriterion
from sklearn.tree._criterion cimport SIZE_t
import numpy as np
cdef double INFINITY = np.inf
from libc.math cimport sqrt, pow
from libc.math cimport abs
cdef class PenalizedHellingerDistanceCriterion(ClassificationCriterion):
cdef double proxy_impurity_improvement(self) nogil:
cdef double impurity_left
cdef double impurity_right
self.children_impurity(&impurity_left, &impurity_right)
return impurity_right + impurity_left
cdef double impurity_improvement(self, double impurity) nogil:
cdef double impurity_left
cdef double impurity_right
self.children_impurity(&impurity_left, &impurity_right)
return impurity_right + impurity_left
cdef double node_impurity(self) nogil:
cdef SIZE_t* n_classes = self.n_classes
cdef double* sum_total = self.sum_total
cdef double hellinger = 0.0
cdef double sq_count
cdef double count_k
cdef SIZE_t k
cdef SIZE_t c
for k in range(self.n_outputs):
for c in range(n_classes[k]):
hellinger += 1.0
return hellinger / self.n_outputs
cdef void children_impurity(self, double* impurity_left,
double* impurity_right) nogil:
cdef SIZE_t* n_classes = self.n_classes
cdef double* sum_left = self.sum_left
cdef double* sum_right = self.sum_right
cdef double hellinger_left = 0.0
cdef double hellinger_right = 0.0
cdef double count_k1 = 0.0
cdef double count_k2 = 0.0
cdef SIZE_t k
cdef SIZE_t c
# stop splitting in case reached pure node with 0 samples of second class
if sum_left[1] + sum_right[1] == 0:
impurity_left[0] = -INFINITY
impurity_right[0] = -INFINITY
return
for k in range(self.n_outputs):
if(sum_left[0] + sum_right[0] > 0):
count_k1 = sqrt(sum_left[0] / (sum_left[0] + sum_right[0]))
if(sum_left[1] + sum_right[1] > 0):
count_k2 = sqrt(sum_left[1] / (sum_left[1] + sum_right[1]))
hellinger_left += pow((count_k1 - count_k2), 2)
if(sum_left[0] + sum_right[0] > 0):
count_k1 = sqrt(sum_right[0] / (sum_left[0] + sum_right[0]))
if(sum_left[1] + sum_right[1] > 0):
count_k2 = sqrt(sum_right[1] / (sum_left[1] + sum_right[1]))
if k==0:
hellinger_right += pow((count_k1 - count_k2), 2)
else:
hellinger_right -= pow((count_k1 - count_k2), 2)
impurity_left[0] = hellinger_left / self.n_outputs
impurity_right[0] = hellinger_right / self.n_outputs
```

Let's use this for training and test it:

ensemble = [ DecisionTreeClassifier( criterion=PenalizedHellingerDistanceCriterion(

2, np.array([2, 2], dtype='int64')

), max_depth=100 ) for i in range(100) ] for model in ensemble: model.fit( X_train, X_train.join(

data,

rsuffix='_right'

)[['is_recid', 'race_black']] ) Y_pred = np.array(

[model.predict(X_test) for model in ensemble]

) predictions2 = Y_pred.mean(axis=0)

This gives us an AUC of 0.62:

We can see that, although we came a long way, we didn't completely remove all bias. 30% (DFP for African-Americans) would still be considered unacceptable. We could try different refinements or sampling strategies to improve the result. Unfortunately, we wouldn't be able to use this model in practice.

As an example, a way to address this is to do model selection within the random forest. Since each of the trees would have their own way of classifying people, we could calculate the adverse impact statistics for each of the individual trees or combinations of trees. We could remove trees until we are left with a subset of trees that satisfy our adverse impact conditions. This is beyond the scope of this chapter.

## See also

You can read up more on algorithmic fairness in different places. There's a wide variety of literature available on fairness:

- A Science Magazine article about the COMPAS model (Julia Dressel and Hany Farid, 2018,
*The accuracy, fairness, and limits of predicting recidivism*): https://advances.sciencemag.org/content/4/1/eaao5580 *A comparative study of fairness-enhancing interventions in machine learning*(Sorelle Friedler and others, 2018): https://arxiv.org/pdf/1802.04422.pdf*A Survey on Bias and Fairness in Machine Learning*(Mehrabi and others, 2019): https://arxiv.org/pdf/1908.09635.pdf- The effect of explaining fairness (Jonathan Dodge and others, 2019): https://arxiv.org/pdf/1901.07694.pdf

Different Python libraries are available for tackling bias (or, inversely, algorithmic fairness):

- fairlearn: https://github.com/fairlearn/fairlearn
- AIF360: https://github.com/IBM/AIF360
- FairML: https://github.com/adebayoj/fairml
- BlackBoxAuditing: https://github.com/algofairness/BlackBoxAuditing
- Balanced Committee Election: https://github.com/huanglx12/Balanced-Committee-Election

Finally, Scikit-Lego includes functionality for fairness: https://scikit-lego.readthedocs.io/en/latest/fairness.html

While you can find many datasets on recidivism by performing a Google dataset search (https://toolbox.google.com/datasetsearch), there are many more applications and corresponding datasets where fairness is important, such as credit scoring, face recognition, recruitment, or predictive policing, to name just a few.

There are different places to find out more about custom losses. The article *Custom loss versus custom scoring* (https://kiwidamien.github.io/custom-loss-vs-custom-scoring.html) affords a good overview. For implementations of custom loss functions in gradient boosting, towardsdatascience (https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-f79c1b40466d) is a good place to start.

# Forecasting CO_{2} time series

In this recipe, we will test out some well-known models (ARIMA, SARIMA) and signal decomposition by forecasting using Facebook's Prophet library on the time series data, in order to check their performance at forecasting our time series of CO_{2} values.

## Getting ready

In order to prepare for this recipe, we'll install libraries and download a dataset.

We'll use the `statsmodels` package and prophet:

pip install statsmodels fbprophet

We will analyze the CO_{2} concentration data in this recipe. You can see the data loading in the notebook on GitHub accompanying this recipe, or in the scikit-learn **Gaussian process regression** (**GPR**) example regarding Mauna Loa CO_{2} data: https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-co2-py

This dataset is one of the earliest recordings available on atmospheric recordings of CO_{2}. As it will be later observed, this data follows a sinusoidal pattern, with the CO_{2} concentration rising in winter and falling in the summer owing to the decreasing quantity of plants and vegetation in the winter season:

`X,y = load_mauna_loa_atmospheric_co2()`

The dataset contains the average CO_{2} concentration measured at Mauna Loa Observatory in Hawaii from 1958 to 2001. We will model the CO_{2} concentration with respect to that.

## How to do it...

Now we'll get to forecasting our time series of CO_{2} data. We'll first explore the dataset, and then we'll apply the ARIMA and SARIMA techniques.

- Let's have a look at the time series:

df_CO2 = pd.DataFrame(data = X, columns = ['Year'])

df_CO2['CO2 in ppm'] = y

lm = sns.lmplot(x='Year', y='CO2 in ppm', data=df_CO2, height=4, aspect=4)

fig = lm.fig

fig.suptitle('CO2 conc. mauna_loa 1958-2001', fontsize=12)

Here's the graph:

The script here shows the time series seasonal decomposition of the CO_{2} data, showing a clear seasonal variation in the CO_{2} concentration, which can be traced back to the biology:

import statsmodels.api as stmd

d = stm.datasets.co2.load_pandas()

co2 = d.data

co2.head()

y = co2['co2']

y = y.fillna(

y.interpolate()

) # Fill missing values by interpolation

Now that we have preprocessed data for decomposition, let's go ahead with it:

from pylab import rcParams

rcParams['figure.figsize'] = 11, 9

result = stm.tsa.seasonal_decompose(y, model='additive')

pd.plotting.register_matplotlib_converters()

result.plot()

plt.show()

Here, we see the decomposition: the observed time series, its trend, seasonal components, and what remains unexplained, the residual element:

Now, let's analyze the time series.

### Analyzing time series using ARIMA and SARIMA

We will fit ARIMA and SARIMA models to the dataset.

We'll define our two models and apply them to each point in the test dataset. Here, we iteratively fit the model on all the points and predict the next point, as a one-step-ahead.

- First, we split the data:

# taking a 90/10 split between training and testing:

future = int(len(y) * 0.9)

print('number of train samples: %d test samples %d' (future, len(y)-future)

)

train, test = y[:future], y[future:]

This leaves us with 468 samples for training and 53 for testing.

- Next, we define the models:

from statsmodels.tsa.arima_model import ARIMA

from statsmodels.tsa.statespace.sarimax import SARIMAX

def get_arima_model(history, order=(5, 1, 0)):

return ARIMA(history, order=order)

def get_sarima_model(

history,

order=(5, 1, 1),

seasonal_order=(0, 1, 1, 4)

):

return SARIMAX(

history,

order=order,

enforce_stationarity=True,

enforce_invertibility=False,

seasonal_order=seasonal_order

)

- Then we train the models:

from sklearn.metrics import mean_squared_error

def apply_model(train, test, model_fun=get_arima_model):

'''we just roll with the model and apply it to successive

time steps

'''

history = list(train)

predictions = []

for t in test:

model = model_fun(history).fit(disp=0)

output = model.forecast()

predictions.append(output[0])

history.append(t)

error = mean_squared_error(test, predictions)

print('Test MSE: %.3f' % error)

#print(model.summary().tables[1])

return predictions, error

predictions_arima, error_arima = apply_model(train, test)

predictions_sarima, error_sarima = apply_model(

train, test, get_sarima_model

)

We get an MSE in the test of 0.554 and 0.405 for ARIMA and SARIMA models, respectively. Let's see how the models fit graphically:

We could do a parameter exploration using the **Akaike information criterion** (**AIC**), which expresses the quality of the model relative to the number of parameters in the model. The model object returned by the fit function in statsmodels includes the AIC, so we could do a grid search over a range of parameters, and then select the model that minimizes the AIC.

## How it works...

Time series data is a collection of observations *x(t)*, where each data point is recorded at time *t*. In most cases, time is a discrete variable, that is, . We are looking at forecasting, which is the task of predicting future values based on the previous observations in the time series.

In order to explain the models that we've used, ARIMA and SARIMA, we'll have to go step by step, and explain each in turn:

**Autoregression**(**AR**)**Moving Average**(**MA**)**Autoregressive Moving Average**(**ARMA**)**Autoregressive Integrated Moving Average**(**ARIMA**) and**Seasonal Autoregressive Integrated Moving Average**(**SARIMA**)

ARIMA and SARIMA are based on the ARMA model, which is an **autoregressive moving average** model. Let's briefly go through some of the basics.

ARMA is a linear model, defined in two parts. First, the autoregressive linear model:

Here, are parameters and is a constant, is white noise, and is the order of the model (or the window size of the linear model). The second part of ARMA is the moving average, and this is again a linear regression, but of non-observable, lagged error terms, defined as follows:

Here, is the order of the moving average, are the parameters, and the expectation or the mean of the time series . The ARMA(p, q) model is then the composite of both of these models, AR(p) and MA(q):

The fitting procedure is a bit involved, particularly because of the MA part. You can read up on the Box-Jenkins method on Wikipedia if you are interested: https://en.wikipedia.org/wiki/Box%E2%80%93Jenkins_method

There are a few limitations to note, however. The time series has to be the following:

- Stationary: Basically, mean and covariance across observations have to be constant over time.
- Nonperiodic: Although bigger values for p and q could be used to model seasonality, it is not part of the model.
- Linear: Each value for can be modeled as a linear combination of previous values and error terms.

There are different extensions of ARMA to address the first two limitations, and that's where ARIMA and SARIMA come in.

ARIMA (*p*, *d*, *q*) stands for **autoregressive integrated moving average**. It comes with three parameters:

**p**: The number of autoregressive terms (autoregression)**d**: The number of non-seasonal differences needed for stationarity (integration)**q**: The number of lagged forecast errors (moving average)

The *integration* refers to differencing. In order to stabilize the mean, we can take the difference between consecutive observations. This can also remove a trend or eliminate seasonality. It can be written as follows:

This can be repeated several times, and this is what the parameter d describes that ARIMA comes with. Please note that ARIMA can handle drifts and non-stationary time series. However, it is still unable to handle seasonality.

SARIMA stands for seasonal ARIMA, and is an extension of ARIMA in that it also takes into account the seasonality of the data.

*SARIMA*(*p*, *d*, *q*)(*P*, *D*, *Q*)*m* contains the non-seasonal parameters of ARIMA and additional seasonal parameters. Uppercase letters P, D, and Q annotate the seasonal moving average and autoregressive components, where *m* is the number of periods in each season. Often this is the number of periods in a year; for example *m=4* would stand for a quarterly seasonal effect, meaning that *D* stands for seasonal differencing between observations *Xt* and *Xt-m*, and *P* and *Q* stand for linear models with backshifts of *m*.

In Python, the statsmodels library provides a method to perform the decomposition of the signal based on the seasonality of the data.

## There's more...

Prophet is a library provided by Facebook for forecasting time series data. It works on an additive model and fits non-linear models. The library works best when the data has strong seasonal effects and has enough historic trends available.

Let's see how we can use it:

from fbprophet import Prophet

train_df = df_CO2_fb['1958':'1997']

test_df = df_CO2_fb['1998':'2001']

train_df = train_df.reset_index()

test_df = test_df.reset_index()Co2_model= Prophet(interval_width=0.95)

Co2_model.fit(train_df)

train_forecast = Co2_model.predict(train_df)

test_forecast = Co2_model.predict(test_df)

fut = Co2_model.make_future_DataFrame(periods=12, freq='M')

forecast_df = Co2_model.predict(fut)

Co2_model.plot(forecast_df)

Here are our model forecasts:

We get a similar decomposition as before with the ARIMA/SARIMA models, namely, the trend and the seasonal components:

The yearly variation nicely shows the rise and fall of the CO_{2} concentration according to the seasons. The trend clearly goes up over time, which could be worrisome if you think about global warming.

## See also

We've used the following libraries in this recipe:

- Statsmodels: http://statsmodels.sourceforge.net/stable/
- Prophet: https://facebook.github.io/prophet/

There are many more interesting libraries relating to time series, including the following:

- Time series modeling using state space models in statsmodels:

https://www.statsmodels.org/dev/statespace.html - GluonTS: Probabilistic Time Series Models in MXNet (Python): https://gluon-ts.mxnet.io/
- SkTime: A unified interface for time series modeling: https://github.com/alan-turing-institute/sktime