Home

Data

The Kaggle Workbook

By Konrad Banachewicz , Luca Massaron

Book + AI Assistant

eBook + AI Assistant $23.99 $15.99

Print $29.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook + AI Assistant $23.99 $15.99

Print $29.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.

Publication date:: February 2023
Publisher: Packt
Pages: 172
ISBN: 9781804611210
Download code from GitHub

The Most Renowned Tabular Competition – Porto Seguro’s Safe Driver Prediction

Learning how to reach the top on the leaderboard in any Kaggle competition requires patience, diligence, and many attempts to learn the best way to compete and achieve top results. For this reason, we have thought of a workbook that can help you build those skills faster by trying some Kaggle competitions of the past and learning how to reach the top of the leaderboard by reading discussions, reusing notebooks, engineering features, and training various models.

We start with one of the most renowned tabular competitions, Porto Seguro’s Safe Driver Prediction. In this competition, you are asked to solve a common problem in insurance and figure out who is going to have a car insurance claim in the next year. Such information is useful to increase the insurance fee for drivers more likely to have a claim and to lower it for those less likely to.

In illustrating the key insights and technicalities necessary for cracking this competition, we will show you the necessary code and ask you to study topics and answer questions found in The Kaggle Book itself. Therefore, without much more ado, let’s start this new learning path of yours.

In this chapter, you will learn:

How to tune and train a LightGBM model
How to build a denoising autoencoder and how to use it to feed a neural network
How to effectively blend models that are quite different from each other

All the code files for this chapter can be found at Change to https://packt.link/kwbchp1

Understanding the competition and the data

Porto Seguro is the third largest insurance company in Brazil (it operates in Brazil and Uruguay), offering car insurance coverage as well as many other insurance products, having used analytical methods and machine learning for the past 20 years to tailor their prices and make auto insurance coverage more accessible to more drivers. To explore new ways to achieve their task, they sponsored a competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction), expecting Kagglers to come up with new and better methods of solving some of their core analytical problems.

The competition is aimed at having Kagglers build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year, which is a quite common kind of task (the sponsor mentions it as a “classical challenge for insurance”). This kind of information about the probability of filing a claim can be quite precious for an insurance company. Without such a model, insurance companies may only charge a flat premium to customers irrespective of their risk, or, if they have a poorly performing model, they may charge a mismatched premium to them. Inaccuracies in profiling the customers’ risk can therefore result in charging a higher insurance cost to good drivers and reducing the price for the bad ones. The impact on the company would be two-fold: good drivers will look elsewhere for their insurance and the company’s portfolio will be overweighed with bad ones (technically, the company would have a bad loss ratio: https://www.investopedia.com/terms/l/loss-ratio.asp). Instead, if the company can correctly estimate the claim likelihood, they can ask for a fair price from their customers, thus increasing their market share, having more satisfied customers and a more balanced customer portfolio (better loss ratio), and managing their reserves better (the money the company sets aside for paying claims).

To do so, the sponsor provided training and test datasets, and the competition was ideal for anyone since the dataset was not very large and was very well prepared.

As stated on the page of the competition devoted to presenting the data (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data):

Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).

In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target column signifies whether or not a claim was filed for that policy holder.

The data preparation for the competition was carefully conducted to avoid any leak of information, and although secrecy has been maintained about the meaning of the features, it is quite clear that the different used tags refer to specific kinds of features commonly used in motor insurance modeling:

ind refers to “individual characteristics”
car refers to “car characteristics”
calc refers to “calculated features”
reg refers to “regional/geographic features”

As for the individual features, there was much speculation about their meaning during the competition. See for instance:

https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41489, where Raddar suggests that the feature ps_car_13 could represent the distance driven between bi-yearly mandatory car checkups.
https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41488, where Raddar suggests that the feature ps_car_12 instead represents engine car cylinder capacity.
https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41057, where you can read about the suggestion to attribute some feature as derived from Porto Seguro’s online quote form.

In spite of all these and more efforts, in the end the meaning of most of the features has remained a mystery up until now.

The interesting facts about this competition are that:

The data is real-world, though the features are anonymous.
The data is very well prepared, without leakages of any sort (no magic features here – a magic feature is a feature that by skillful processing can provide high predictive power to your models in a Kaggle competition).
The test dataset not only holds the same categorical levels as the training dataset; it also seems to be from the same distribution, although Yuya Yamamoto argues that preprocessing the data with t-SNE leads to a failing adversarial validation test (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44784).

Exercise 1

As a first exercise, referring to the contents and the code in The Kaggle Book related to adversarial validation (starting from page 179), prove that the training and test data most probably originated from the same data distribution.

Exercise Notes (write down any notes or workings that will help you):

An interesting post by Tilii (Mensur Dlakic, Associate Professor at Montana State University: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42197) demonstrates using t-SNE that “there are many people who are very similar in terms of their insurance parameters, yet some of them will file a claim and others will not.” What Tilii mentions is quite typical of what happens in insurance, where for certain priors (insurance parameters) there is the same probability of something happening, but that event will happen or not based on how long we observe the sequence of events.

Take, for instance, IoT and telematic data in insurance. It is quite common to analyze a driver’s behavior to predict if they will file a claim in the future. If your observation period is too short (for instance, one year, as in the case of this competition), it may happen that even very bad drivers won’t have a claim because there is a low probability that such an event will occur in a short period of time, even for a bad driver. Similar ideas are discussed by Andy Harless (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42735), who argues instead that the real task of the competition is to guess "the value of a latent continuous variable that determines which drivers are more likely to have accidents" because actually "making a claim is not a characteristic of a driver; it’s a result of chance."

Understanding the evaluation metric

The metric used in the competition is the normalized Gini coefficient (named after the similar Gini coefficient/index used in economics), which has been previously used in another competition, the Allstate Claim Prediction Challenge (https://www.kaggle.com/competitions/ClaimPredictionChallenge). From that competition, we can get a very clear explanation of what this metric is about:

When you submit an entry, the observations are sorted from “largest prediction” to “smallest prediction.” This is the only step where your predictions come into play, so only the order determined by your predictions matters. Visualize the observations arranged from left to right, with the largest predictions on the left. We then move from left to right, asking “In the leftmost x% of the data, how much of the actual observed loss have you accumulated?” With no model, you can expect to accumulate 10% of the loss in 10% of the predictions, so no model (or a “null” model) achieves a straight line. We call the area between your curve and this straight line the Gini coefficient.

There is a maximum achievable area for a “perfect” model. We will use the normalized Gini coefficient by dividing the Gini coefficient of your model by the Gini coefficient of the perfect model.

There is no formulation proposed by the organizers of the competition for the Normalized Gini apart from this verbose description, but by reading the notebook from Mohsin Hasan (https://www.kaggle.com/code/tezdhar/faster-gini-calculation/notebook), we can figure out that it is calculated in two steps and can obtain some easy to understand pseudocode that reveals its inner workings. First, you get the Gini coefficient for your predictions, then you normalize it by dividing it by another Gini coefficient computed by pretending you have perfect predictions. Here is the pseudocode for the Gini coefficient:

order = indexes of sorted predictions (expressed as probabilities from lowest to highest)

sorted_actual = actual[order] = ground truth values sorted based on indexes of sorted predictions

cumsum_sorted_actual = cumulated sum of the sorted ground truth values

n = number of predictions

gini_coef = (sum(cumsum_sorted_actual ) / sum(sorted_actual ) - (n + 1) / 2) / n

Once you have the Gini coefficient for your predictions, you need to divide it by the Gini coefficient you compute using the ground truth values as they were your predictions (the case of having perfect predictions)

norm_gini_coef = gini_coef(predictions) / gini_coef(ground truth)

Another good explanation is provided in the notebook by Kilian Batzner: https://www.kaggle.com/code/batzner/gini-coefficient-an-intuitive-explanation. Using clear plots and some toy examples, Kilian tries to make sense of a not-so-common metric, yet routinely used by the actuarial departments of insurance companies.

The metric can be approximated by the ROC-AUC score or the Mann–Whitney U non-parametric statistical test (since the U statistic is equivalent to the area under the receiver operating characteristic curve – AUC) because it approximately corresponds to 2 * ROC-AUC - 1. Hence, maximizing the ROC-AUC is the same as maximizing the normalized Gini coefficient (for a reference see the Relation to other statistical measures section in the Wikipedia entry: https://en.wikipedia.org/wiki/Gini_coefficient).

The metric can also be approximately expressed as the covariance of scaled prediction rank and scaled target value, resulting in a more understandable rank association measure (see Dmitriy Guller: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/40576).

From the point of view of the objective function, you can optimize for the binary log-loss (as you would do in a classification problem). Neither ROC-AUC nor the normalized Gini coefficient is differentiable, and they may be used only for metric evaluation on the validation set (for instance, for early stopping or for reducing the learning rate in a neural network). However, optimizing for the log-loss does not always improve the ROC-AUC and the normalized Gini coefficients and neither of them is directly differentiable.

There is actually a differentiable ROC-AUC approximation. You can read about how it works in Toon Calders, and Szymon Jaroszewicz Efficient AUC Optimization for Classification. European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2007: https://link.springer.com/content/pdf/10.1007/978-3-540-74976-9_8.pdf.

However, it seems that it is not necessary to use anything different from log-loss as an objective function and ROC-AUC or normalized Gini coefficient as an evaluation metric in the competition.

There are actually a few Python implementations for computing the normalized Gini coefficient among the Kaggle Notebooks. We have used here and suggest the work by CPMP (https://www.kaggle.com/code/cpmpml/extremely-fast-gini-computation/notebook) that uses Numba for speeding up computations: it is both exact and fast.

Exercise 2

In chapter 5 of The Kaggle Book (page 95 onward), we explained how to deal with competition metrics, especially if they are new and generally unknown.

As an exercise, can you find out how many competitions on Kaggle have used the normalized Gini coefficient as an evaluation metric?

Exercise Notes (write down any notes or workings that will help you):

Examining the top solution ideas from Michael Jahrer

Michael Jahrer (https://www.kaggle.com/mjahrer, competition Grandmaster and one of the winners of the Netflix Prize in the team “BellKor’s Pragmatic Chaos”) led the public leaderboard for a long time and by a fair margin during the competition and was declared the winner when the private solutions were finally disclosed.

Shortly after, in the discussion forum, he published a short summary of his solution that has become a reference for many Kagglers because of his smart usage of denoising autoencoders and neural networks (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44629). Although Michael hasn’t accompanied his post with any Python code regarding his solution (he described his coding work as an “old-school” and “low-level” one, being directly written in C++/CUDA with no Python), his writing is quite rich in references to what models he has used as well as their hyperparameters and architectures.

First, Michael explains that his solution is composed of a blend of six models (one LightGBM model and five neural networks). Moreover, since no advantage could be gained by weighting the contributions of each model to the blend (as well as doing linear and non-linear stacking), likely because of overfitting, he states that he resorted to just a blend of models (where all the models had equal weight) that have been built from different seeds.

Such insight makes the task much easier for us to replicate his approach, also because he also because he mentions that just having blended the LightGBM’s results with one from the neural networks he built would have been enough to guarantee first place in the competition.

This insight will limit our exercise work to two good single models instead of a host of them. In addition, he mentioned having done little data processing, besides dropping some columns and one-hot encoding categorical features.

Building a LightGBM submission

Our exercise starts by working out a solution based on LightGBM. You can find the code already set for execution using Kaggle Notebooks at this address: https://www.kaggle.com/code/lucamassaron/workbook-lgb. Although we made the code readily available, we instead suggest you type or copy the code directly from the book and execute it cell by cell; understanding what each line of code does and personalizing the solution can make it perform even better.

When using LightGBM you don’t have to, and should not, turn on any of the GPU or TPU accelerators. GPU acceleration could be helpful only if you have installed the GPU version of LightGBM. You can find working hints on how to install such a GPU-accelerated version on Kaggle Notebooks using this example: https://www.kaggle.com/code/lucamassaron/gpu-accelerated-lightgbm.

We start by importing key packages (NumPy, pandas, and Optuna for hyperparameter optimization, LightGBM, and some utility functions). We also define a configuration class and instantiate it. We will discuss the parameters defined in the configuration class during the exploration of the code as we progress. What is important to remark here is that by using a class containing all your parameters it will be easier for you to modify them in a consistent way along with the code. In the heat of competition, it is easy to forget to update a parameter that is referred to in multiple places in the code, and it is always difficult to set the parameters when they are dispersed among cells and functions. A configuration class can save you a lot of effort and spare you mistakes along the way:

import numpy as np
import pandas as pd
import optuna
import lightgbm as lgb
from path import Path
from sklearn.model_selection import StratifiedKFold
class Config:
    input_path = Path('../input/porto-seguro-safe-driver-prediction')
    optuna_lgb = False
    n_estimators = 1500
    early_stopping_round = 150
    cv_folds = 5
    random_state = 0
    params = {'objective': 'binary',
              'boosting_type': 'gbdt',
              'learning_rate': 0.01,
              'max_bin': 25,
              'num_leaves': 31,
              'min_child_samples': 1500,
              'colsample_bytree': 0.7,
              'subsample_freq': 1,
              'subsample': 0.7,
              'reg_alpha': 1.0,
              'reg_lambda': 1.0,
              'verbosity': 0,
              'random_state': 0}
    
config = Config()

The next step requires importing the training, test, and sample submission datasets. We do this using the pandas read_csv function. We also set the index of the uploaded DataFrames to the identifier (the id column) of each data example.

Since features that belong to similar groupings are tagged (using ind, reg, car, and calc tags in their labels) and also binary and categorical features are easy to locate (they use the bin and cat tags, respectively, in their labels), we can enumerate them and record them in lists:

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')
test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]

Then, we just extract the target (a binary target of 0s and 1s) and remove it from the training dataset:

target = train["target"]
train = train.drop("target", axis="columns")

At this point, as pointed out by Michael Jahrer, we can drop the calc features. This idea has recurred a lot during the competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41970), especially in notebooks, because it could be empirically verified that dropping them improved both the local cross-validation score and the public leaderboard score (as a general rule, it’s important to keep track of both during feature selection). In addition, they also performed poorly in gradient boosting models (their importance is always below the average).

We can argue that, since they are engineered features, they do not contain new information in respect of their original features, but they just add noise to any model trained that comprises them:

train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")

Exercise 3

Based on the suggestions provided in The Kaggle Book on page 220 (Using feature importance to evaluate your work), as an exercise:

Code your own feature selection notebook for this competition.
Check what features should be kept and what should be discarded.

Exercise Notes (write down any notes or workings that will help you):

Categorical features are instead one-hot encoded. Because the same labels are present in the training and test datasets (the result of a careful train/test split between the two arranged by the Porto Seguro team), instead of the usual scikit-learn OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) we are going to use the pandas get_dummies function (https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html). Since the pandas function may produce different encodings if the features and their levels differ from train to test set, we assert a check on the one-hot encoding, resulting in the same for both:

train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)
assert((train.columns==test.columns).all())

One-hot encoding the categorical features completes the data processing stage. We proceed to define our evaluation metric, the normalized Gini coefficient, as previously discussed. We will use the extremely fast Gini computation code proposed by CPMP, as mentioned before.

Since we are going to use a LightGBM model, we have to add a suitable wrapper (gini_lgb) to return to the GBM algorithm the evaluation of the training and the validation datasets in a form that can work with it (see https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html?highlight=higher_better#lightgbm.Booster.eval – Each evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples):

from numba import jit
@jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini
def gini_lgb(y_true, y_pred):
    eval_name = 'normalized_gini_coef'
    eval_result = eval_gini(y_true, y_pred)
    is_higher_better = True
    return eval_name, eval_result, is_higher_better

As for the training parameters, we found that the parameters suggested by Michael Jahrer in his post (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44629) work perfectly.

You may also try to come up with the same parameters or similar performing ones by performing a search by Optuna (https://optuna.org/) if you set the optuna_lgb flag to True in the Config class. Here the optimization tries to find the best values for key parameters, such as the learning rate and the regularization parameters, based on a five-fold cross-validation test on training data. To speed up things, early stopping on the validation itself is taken into account (which, we are aware, could actually advantage picking some parameters that can better overfit the validation fold – a good alternative could be to remove the early stopping callback and keep a fixed number of rounds for the training):

if config.optuna_lgb:
        
    def objective(trial):
        params = {
    'learning_rate': trial.suggest_float("learning_rate", 0.01, 1.0),
    'num_leaves': trial.suggest_int("num_leaves", 3, 255),
    'min_child_samples': trial.suggest_int("min_child_samples", 
                                           3, 3000),
    'colsample_bytree': trial.suggest_float("colsample_bytree", 
                                            0.1, 1.0),
    'subsample_freq': trial.suggest_int("subsample_freq", 0, 10),
    'subsample': trial.suggest_float("subsample", 0.1, 1.0),
    'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-9, 10.0),
    'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-9, 10.0),
        }
        
        score = list()
        skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, 
                              random_state=config.random_state)
        for train_idx, valid_idx in skf.split(train, target):
            X_train = train.iloc[train_idx]
            y_train = target.iloc[train_idx]
            X_valid = train.iloc[valid_idx] 
            y_valid = target.iloc[valid_idx]
            model = lgb.LGBMClassifier(**params,
                                    n_estimators=1500,
                                    early_stopping_round=150,
                                    force_row_wise=True)
            callbacks=[lgb.early_stopping(stopping_rounds=150, 
                                          verbose=False)]
            model.fit(X_train, y_train, 
                      eval_set=[(X_valid, y_valid)],  
                      eval_metric=gini_lgb, callbacks=callbacks)
              
            score.append(
                model.best_score_['valid_0']['normalized_gini_coef'])
        return np.mean(score)
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=300)
    print("Best Gini Normalized Score", study.best_value)
    print("Best parameters", study.best_params)
    
    params = {'objective': 'binary',
              'boosting_type': 'gbdt',
              'verbosity': 0,
              'random_state': 0}
    
    params.update(study.best_params)
    
else:
    params = config.params

During the competition, Tilii tested feature elimination using Boruta (https://github.com/scikit-learn-contrib/boruta_py). You can find his kernel here: https://www.kaggle.com/code/tilii7/boruta-feature-elimination/notebook. As you can check, there is no calc_feature considered a confirmed feature by Boruta.

Exercise 4

In The Kaggle Book, we explain hyperparameter optimization (page 241 onward) and provide some key hyperparameters for the LightGBM model.

As an exercise:

Try to improve the hyperparameter search by Optuna by reducing or increasing the explored parameters where you deem it necessary, and also try alternative optimization methods, such as the random search or the halving search from scikit-learn (pages 245–246).

Exercise Notes (write down any notes or workings that will help you):

Once we have got our best parameters (or we simply try Jahrer’s ones), we are ready to train and predict. Our strategy, as suggested by the best solution, is to train a model on each cross-validation fold and use that fold to contribute to an average of test predictions. The snippet of code will produce both the test predictions and the out-of-fold predictions on the training dataset, which will be useful for figuring out how to ensemble the results:

preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
for idx, (train_idx, valid_idx) in enumerate(skf.split(train, 
                                                       target)):
    print(f"CV fold {idx}")
    X_train, y_train = train.iloc[train_idx], target.iloc[train_idx]
    X_valid, y_valid = train.iloc[valid_idx], target.iloc[valid_idx]
    
    model = lgb.LGBMClassifier(**params,
                               n_estimators=config.n_estimators,
                    early_stopping_round=config.early_stopping_round,
                               force_row_wise=True)
    
    callbacks=[lgb.early_stopping(stopping_rounds=150), 
               lgb.log_evaluation(period=100, show_stdv=False)]
                                                                                           
    model.fit(X_train, y_train, 
              eval_set=[(X_valid, y_valid)], 
              eval_metric=gini_lgb, callbacks=callbacks)
    metric_evaluations.append(
                model.best_score_['valid_0']['normalized_gini_coef'])
    preds += (model.predict_proba(test,  
              num_iteration=model.best_iteration_)[:,1] 
              / skf.n_splits)
    oof[valid_idx] = model.predict_proba(X_valid, 
                    num_iteration=model.best_iteration_)[:,1]

The model training shouldn’t take too long. In the end you can get the reported Normalized Gini Coefficient obtained during the cross-validation procedure:

print(f"LightGBM CV normalized Gini coefficient: 
        {np.mean(metric_evaluations):0.3f}
        ({np.std(metric_evaluations):0.3f})")

The results are quite encouraging because the average score is 0.289 and the standard deviation of the values is quite small:

LightGBM CV Gini Normalized Score: 0.289 (0.015)

All that is left is to save the out-of-fold and test predictions as a submission and to verify the results on the public and private leaderboards:

submission['target'] = preds
submission.to_csv('lgb_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('lgb_oof.csv', index=False)

The obtained public score should be around 0.28442. The associated private score is about 0.29121, placing you in the 29^th position on the final leaderboard. A quite good result, but we still have to blend it with a different model, a neural network.

Bagging the training set (i.e., taking multiple bootstraps of the training data and training multiple models based on the bootstraps) should increase the performance, although, as Michael Jahrer himself noted in his post, not all that much.

Setting up a denoising autoencoder and a DNN

The next step is to set up a denoising autoencoder (DAE) and a neural network that can learn and predict from it. You can find the running code in this notebook: https://www.kaggle.com/code/lucamassaron/workbook-dae. The notebook can be run in GPU mode (it will therefore be speedier if you turn on the accelerators in the Kaggle Notebook), but it can also run in CPU mode with some slight modifications.

You can read more about denoising autoencoders being used in Kaggle competitions in The Kaggle Book, from page 230 onward.

Actually there are no examples that reproduce Michael Jahrer’s approach in the competition using DAEs, so we took an example from a TensorFlow implementation in another competition coded by OsciiArt (https://www.kaggle.com/code/osciiart/denoising-autoencoder).

Here we start by importing all the necessary packages, especially TensorFlow and Keras. Since we are going to create multiple neural networks, we point out to TensorFlow not to use all the GPU memory available by using the experimental set_memory_growth command. This will help us avoid having memory overflow problems along the way. We also record the Leaky ReLu activation as a custom one, so we can just mention it as an activation by a string in the Keras layers:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from path import Path
import gc
import optuna
from sklearn.model_selection import StratifiedKFold
from scipy.special import erfinv
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2
from tensorflow.keras.metrics import AUC
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.layers import Activation, LeakyReLU
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})

Related to our intention of creating multiple neural networks without running out of memory, we also define a simple function for cleaning the memory in GPU and removing models that are no longer needed:

def gpu_cleanup(objects):
    if objects:
        del(objects)
    K.clear_session()
    gc.collect()

We also reconfigure the Config class to take into account multiple parameters related to the denoising autoencoder and the neural network. As previously stated about the LightGBM, having all the parameters in a unique place simplifies the process when you have to change them in a consistent way:

class Config:
    input_path = Path('../input/porto-seguro-safe-driver-prediction')
    dae_batch_size = 128
    dae_num_epoch = 50
    dae_architecture = [1500, 1500, 1500]
    reuse_autoencoder = False
    batch_size = 128
    num_epoch = 150
    units = [64, 32]
    input_dropout=0.06
    dropout=0.08
    regL2=0.09
    activation='selu'
    
    cv_folds = 5
    nas = False
    random_state = 0
    
config = Config()

As shown previously, we load the datasets and proceed to process the features by removing the calc features and one-hot encoding the categorical ones. We leave missing cases valued at -1, as Michael Jahrer pointed out in his solution:

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')
test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]
target = train["target"]
train = train.drop("target", axis="columns")
train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")
train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)
assert((train.columns==test.columns).all())

However, since we are dealing with neural networks, we have to normalize all the features that are not binary or one-hot-encoded categorical. Normalization implies rescaling (setting a limited range of values) and centering (your distribution will be centered to a certain value, usually zero)

Normalization will allow the optimization algorithm of both the autoencoder and the neural network to converge to a good solution faster because it reduces the danger of oscillations of the loss function during the optimization. In addition, normalization facilitates the propagation of the input through the activation functions.

Instead of using statistical normalization (bringing your distribution of values to have zero mean and unit standard deviation), GaussRank is a procedure that also allows the modification of the distribution of the variables into a transformed Gaussian one. As also stated in some papers, such as in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (https://arxiv.org/pdf/1502.03167.pdf), neural networks perform even better if you provide them with a Gaussian input. Accordingly to this NVIDIA blog post, https://developer.nvidia.com/blog/gauss-rank-transformation-is-100x-faster-with-rapids-and-cupy/, GaussRank works most of the time, except when features are already normally distributed or are extremely asymmetrical (in such cases applying the transformation may lead to worsened performance):

print("Applying GaussRank to columns: ", end='')
to_normalize = list()
for k, col in enumerate(train.columns):
    if '_bin' not in col and '_cat' not in col and '_missing' not in col:
        to_normalize.append(col)
print(to_normalize)
def to_gauss(x): return np.sqrt(2) * erfinv(x) 
def normalize(data, norm_cols):
    n = data.shape[0]
    for col in norm_cols:
        sorted_idx = data[col].sort_values().index.tolist()
        uniform = np.linspace(start=-0.99, stop=0.99, num=n)
        normal = to_gauss(uniform)
        normalized_col = pd.Series(index=sorted_idx, data=normal)
        data[col] = normalized_col
    return data
train = normalize(train, to_normalize)
test = normalize(test, to_normalize)

We can apply the GaussRank transformation separately on the train and test features on all the numeric features of our dataset:

Applying GaussRank to columns: ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

When normalizing the features, we simply turn our data into a NumPy array of float32 values, the ideal input for a GPU:

features = train.columns
train_index = train.index
test_index = test.index
train = train.values.astype(np.float32)
test = test.values.astype(np.float32)

Next, we just prepare some useful functions, such as the evaluation function, the normalized Gini coefficient (based on the code described before), and a plotting function that helpfully represents a Keras model history of fitting on both training and validation sets:

def plot_keras_history(history, measures):
    rows = len(measures) // 2 + len(measures) % 2
    fig, panels = plt.subplots(rows, 2, figsize=(15, 5))
    plt.subplots_adjust(top = 0.99, bottom=0.01, 
                        hspace=0.4, wspace=0.2)
    try:
        panels = [item for sublist in panels for item in sublist]
    except:
        pass
    for k, measure in enumerate(measures):
        panel = panels[k]
        panel.set_title(measure + ' history')
        panel.plot(history.epoch, history.history[measure],  
                   label="Train "+measure)
        try:
            panel.plot(history.epoch,  
                       history.history["val_"+measure], 
                       label="Validation "+measure)
        except:
            pass
        panel.set(xlabel='epochs', ylabel=measure)
        panel.legend()
        
    plt.show(fig)
from numba import jit
@jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini

The next functions are actually a bit more complex and more related to the functioning of both the denoising autoencoder and the supervised neural network. The batch_generator is a function that will create a generator that provides shuffled chunks of the data based on batch size. It isn’t actually used as a standalone generator but as part of a more complex batch generator that we will soon describe, the mixup_generator:

def batch_generator(x, batch_size, shuffle=True, random_state=None):
    batch_index = 0
    n = x.shape[0]
    while True:
        if batch_index == 0:
            index_array = np.arange(n)
            if shuffle:
                np.random.seed(seed=random_state)
                index_array = np.random.permutation(n)
        current_index = (batch_index * batch_size) % n
        if n >= current_index + batch_size:
            current_batch_size = batch_size
            batch_index += 1
        else:
            current_batch_size = n - current_index
            batch_index = 0
        batch = x[index_array[current_index: current_index + current_batch_size]]
        yield batch

The mixup_generator is a generator that returns batches of data whose values have been partially swapped to create some noise and augment the data to avoid the DAE overfitting to the training dataset. You can look at this generator as a way to inject random values into the dataset and create many more examples to be used for training. It works based on a swap rate, fixed at 15%, of features as suggested by Michael Jahrer, implying that at every batch, you will have 15% of the random values in the sample. It is also important to point out that having the random values picked randomly from the very same features means that the replacing random values are not completely random, since they are from the same distribution of the original features.

The function generates two distinct batches of data, one to be released to the model and another to be used as a source for the value to be swapped in the batch to be released. Based on a random choice, whose base probability is the swap rate, at each batch, a certain number of features will be swapped between the two batches.

This means that the DAE cannot always rely on the same features (since they can be randomly swapped from time to time) but instead has to concentrate on the whole of the features (something similar to dropout in a certain sense) to find relationships between them and correctly reconstruct the data at the end of the process:

def mixup_generator(X, batch_size, swaprate=0.15, shuffle=True, random_state=None):
    if random_state is None:
        random_state = np.randint(0, 999)
    num_features = X.shape[1]
    num_swaps = int(num_features * swaprate)    
    generator_a = batch_generator(X, batch_size, shuffle, 
                                  random_state)
    generator_b = batch_generator(X, batch_size, shuffle, 
                                  random_state + 1)
    while True:
        batch = next(generator_a)
        mixed_batch = batch.copy()
        effective_batch_size = batch.shape[0]
        alternative_batch = next(generator_b)
        assert((batch != alternative_batch).any())
        for i in range(effective_batch_size):
            swap_idx = np.random.choice(num_features, num_swaps, 
                                        replace=False)
            mixed_batch[i, swap_idx] = alternative_batch[i, swap_idx]
        yield (mixed_batch, batch)

The get_DAE is the function that builds the denoising autoencoder. It accepts a parameter for defining the architecture, which in our case has been set to three layers of 1,500 nodes each (as suggested by Michael Jahrer’s solution). The first layer should act as an encoder, the second is a bottleneck layer ideally containing the latent features capable of expressing the information in the data, and the last layer is a decoding layer capable of reconstructing the initial input data. The three layers have a relu activation function, no bias, and each one is followed by a batch normalization layer. The final output with the reconstructed input data has a linear activation. The training is optimized using an adam optimizer with standard settings (the optimized cost function is the mean squared error – mse):

def get_DAE(X, architecture=[1500, 1500, 1500]):
    features = X.shape[1]
    inputs = Input((features,))
    for i, nodes in enumerate(architecture):
        layer = Dense(nodes, activation='relu', 
                      use_bias=False, name=f"code_{i+1}")
        if i==0:
            x = layer(inputs)
        else:
            x = layer(x)
        x = BatchNormalization()(x)
    outputs = Dense(features, activation='linear')(x)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='mse', 
                  metrics=['mse', 'mae'])
    return model

The extract_dae_features function is reported here only for educational purposes. The function helps in the extraction of the values of specific layers of the trained denoising autoencoder. The extraction works by building a new model, combining the DAE input layer and the desired output one. A simple predict will then extract the values we need (the predict also allows us to fix the preferred batch size in order to fit any memory requirement).

In the case of the competition, given the number of observations and the number of features to be taken out from the autoencoder, if we were to use this function, the resulting dense matrix would be too large to be handled by the memory of a Kaggle Notebook. For this reason, our strategy won’t be to transform the original data into the autoencoder node values of the bottleneck layer but to instead fuse the autoencoder with its frozen layers up to the bottleneck with the supervised neural network, as we will be discussing soon:

def extract_dae_features(autoencoder, X, layers=[3], batch_size=128):
    data = []
    for layer in layers:
        if layer==0:
            data.append(X)
        else:
            get_layer_output = Model([autoencoder.layers[0].input], 
                                  [autoencoder.layers[layer].output])
            layer_output = get_layer_output.predict(X, 
                                              batch_size= batch_size)
            data.append(layer_output)
    data = np.hstack(data)
    return data

To complete the work with the DAE, we have a final function wrapping all the previous ones into an unsupervised training procedure (at least partially unsupervised since there is an early stop monitor set on a validation set). The function sets up the mix-up generator, creates the denoising autoencoder architecture, and then trains it, monitoring its fit on a validation set for an early stop if there are signs of overfitting. Finally, before returning the trained DAE, it plots a graph of the training and validation fit and stores the model on disk.

Even if we try to fix a seed on this model, contrary to the LightGBM model, the results are extremely variable, and they may influence the final ensemble results. Though the result will be a high scoring one, it may land higher or lower on the public and private leaderboards (public results are very correlated to the private leaderboard) and it will be easy for you to always pick up the best final submission based on its public results:

def autoencoder_fitting(X_train, X_valid, filename='dae',  
                        random_state=None, suppress_output=False):
    if suppress_output:
        verbose = 0
    else:
        verbose = 2
        print("Fitting a denoising autoencoder")
    tf.random.set_seed(seed=random_state)
    generator = mixup_generator(X_train, 
                                batch_size=config.dae_batch_size, 
                                swaprate=0.15, 
                                random_state=config.random_state)
                                
    dae = get_DAE(X_train, architecture=config.dae_architecture)
    steps_per_epoch = np.ceil(X_train.shape[0] / 
                              config.dae_batch_size)
    early_stopping = EarlyStopping(monitor='val_mse', 
                                mode='min', 
                                patience=5, 
                                restore_best_weights=True,
                                verbose=0)
    history = dae.fit(generator,
                    steps_per_epoch=steps_per_epoch,
                    epochs=config.dae_num_epoch,
                    validation_data=(X_valid, X_valid),
                    callbacks=[early_stopping],
                    verbose=verbose)
    if not suppress_output: plot_keras_history(history, 
                                           measures=['mse', 'mae'])
    dae.save(filename)
    return dae

Having dealt with the DAE, we take the chance also to define the supervised neural model down the line that should predict our claim expectations. As a first step, we define a function to define a single layer of the work:

Random normal initialization, since empirically it has been found to converge to better results in this problem.
A dense layer with L2 regularization and a customizable activation function.
A tunable dropout layer, which can be easily included or excluded from the architecture.

Here is the code for creating the dense blocks:

def dense_blocks(x, units, activation, regL2, dropout):
    kernel_initializer = keras.initializers.RandomNormal(mean=0.0, 
                                stddev=0.1, seed=config.random_state)
    for k, layer_units in enumerate(units):
        if regL2 > 0:
            x = Dense(layer_units, activation=activation, 
                      kernel_initializer=kernel_initializer, 
                      kernel_regularizer=l2(regL2))(x)
        else:
            x = Dense(layer_units, 
                      kernel_initializer=kernel_initializer, 
                      activation=activation)(x)
        if dropout > 0:
            x = Dropout(dropout)(x)
    return x

As you may have already noticed, the function defining the single layer is quite customizable. The same goes for the wrapper architecture function, taking inputs for the number of layers and units in them, dropout probabilities, regularization, and activation type. The idea is to be able to run a neural architecture search (NAS) and figure out what configuration should perform better in our problem.

As a final note on the function, among the inputs, it is required to provide the trained DAE because its inputs are used as the neural network model inputs while its first layers are connected to the DAE’s bottleneck layer (the middle layer in the DAE architecture). In such a way we are de facto concatenating the two models into one (although the DAE weights are frozen anyway and not trainable).

This solution has been devised to avoid having to transform all your training data and instead only the single batches that the neural network is processing, thus saving memory in the system:

def dnn_model(dae, units=[4500, 1000, 1000], 
            input_dropout=0.1, dropout=0.5,
            regL2=0.05,
            activation='relu'):
    
    inputs = dae.get_layer("code_2").output
    if input_dropout > 0:
        x = Dropout(input_dropout)(inputs)
    else:
        x = tf.keras.layers.Layer()(inputs)
    x = dense_blocks(x, units, activation, regL2, dropout)
    outputs = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=dae.input, outputs=outputs)
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
                loss=keras.losses.binary_crossentropy,
                metrics=[AUC(name='auc')])
    return model

We conclude with a wrapper for the training process, including all the steps in order to train the entire pipeline on a cross-validation fold:

def model_fitting(X_train, y_train, X_valid, y_valid, autoencoder, 
                 filename, random_state=None, suppress_output=False):
        if suppress_output:
            verbose = 0
        else:
            verbose = 2
            print("Fitting model")
        early_stopping = EarlyStopping(monitor='val_auc', 
                                    mode='max', 
                                    patience=10, 
                                    restore_best_weights=True,
                                    verbose=0)
        rlrop = ReduceLROnPlateau(monitor='val_auc', 
                                mode='max',
                                patience=2,
                                factor=0.75,
                                verbose=0)
        
        tf.random.set_seed(seed=random_state)
        model = dnn_model(autoencoder,
                    units=config.units,
                    input_dropout=config.input_dropout,
                    dropout=config.dropout,
                    regL2=config.regL2,
                    activation=config.activation)
        
        history = model.fit(X_train, y_train, 
                            epochs=config.num_epoch, 
                            batch_size=config.batch_size, 
                            validation_data=(X_valid, y_valid),
                            callbacks=[early_stopping, rlrop],
                            shuffle=True,
                            verbose=verbose)
        model.save(filename)
        
        if not suppress_output:  
            plot_keras_history(history, measures=['loss', 'auc'])
        return model, history

Since our DAE implementation is surely different from Jahrer’s, although the idea behind it is the same, we cannot rely completely on his observations on the architecture of the supervised neural network, and we have to look for the ideal indications as we have been looking for the best hyperparameters in the LightGBM model. Using Optuna and leveraging the multiple parameters that we set to configure the network’s architecture, we can run this code snippet for some hours and get an idea about what could work better.

In our experiments we found that:

We should use a two-layer network with fewer nodes, 64 and 32 respectively.
Input dropout, dropout between layers, and some L2 regularization do help.
It is better to use the SELU activation function.

Here is the code snippet for running the entire optimization experiments:

if config.nas is True:
    def evaluate():
        metric_evaluations = list()
        skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
        for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
            
            X_train, y_train = train[train_idx, :], target[train_idx]
            X_valid, y_valid = train[valid_idx, :], target[valid_idx]
            if config.reuse_autoencoder:
                autoencoder = load_model(f"./dae_fold_{k}")
            else:
                autoencoder = autoencoder_fitting(X_train, X_valid,
                                                filename=f'./dae_fold_{k}', 
                                                random_state=config.random_state,
                                                suppress_output=True)
            
            model, _ = model_fitting(X_train, y_train, X_valid, y_valid,
                                        autoencoder=autoencoder,
                                        filename=f"dnn_model_fold_{k}", 
                                        random_state=config.random_state,
                                        suppress_output=True)
            
            val_preds = model.predict(X_valid, batch_size=128, verbose=0)
            best_score = eval_gini(y_true=y_valid, y_pred=np.ravel(val_preds))
            metric_evaluations.append(best_score)
            
            gpu_cleanup([autoencoder, model])
        
        return np.mean(metric_evaluations)
    def objective(trial):
        params = {
                'first_layer': trial.suggest_categorical("first_layer", [8, 16, 32, 64, 128, 256, 512]),
                'second_layer': trial.suggest_categorical("second_layer", [0, 8, 16, 32, 64, 128, 256]),
                'third_layer': trial.suggest_categorical("third_layer", [0, 8, 16, 32, 64, 128, 256]),
                'input_dropout': trial.suggest_float("input_dropout", 0.0, 0.5),
                'dropout': trial.suggest_float("dropout", 0.0, 0.5),
                'regL2': trial.suggest_uniform("regL2", 0.0, 0.1),
                'activation': trial.suggest_categorical("activation", ['relu', 'leaky-relu', 'selu'])
        }
        config.units = [nodes for nodes in [params['first_layer'], params['second_layer'], params['third_layer']] if nodes > 0]
        config.input_dropout = params['input_dropout']
        config.dropout = params['dropout']
        config.regL2 = params['regL2']
        config.activation = params['activation']
        
        return evaluate()
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=60)
    print("Best Gini Normalized Score", study.best_value)
    print("Best parameters", study.best_params)
    config.units = [nodes for nodes in [study.best_params['first_layer'], study.best_params['second_layer'], study.best_params['third_layer']] if nodes > 0]
    config.input_dropout = study.best_params['input_dropout']
    config.dropout = study.best_params['dropout']
    config.regL2 = study.best_params['regL2']
    config.activation = study.best_params['activation']

Exercise 5

If you are looking for more information about NAS, you can have a look at The Kaggle Book, on page 276 onward. In the case of the DAE and the supervised neural network, it is critical to look for the best architecture since we are implementing something surely different from Michael Jahrer’s solution.

As an exercise, try to improve the hyperparameter search by using KerasTuner (to be found on page 285 onward in The Kaggle Book), a fast solution for optimizing neural networks that includes the contribution of François Chollet, the creator of Keras.

Exercise Notes (write down any notes or workings that will help you):

Having finally set everything ready, we are set to start the training. In about one hour, on a Kaggle Notebook with GPU, you can obtain complete test and out-of-fold predictions:

preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
    print(f"CV fold {k}")
    
    X_train, y_train = train[train_idx, :], target[train_idx]
    X_valid, y_valid = train[valid_idx, :], target[valid_idx]
    if config.reuse_autoencoder:
        print("restoring previously trained dae")
        autoencoder = load_model(f"./dae_fold_{k}")
    else:
        autoencoder = autoencoder_fitting(X_train, X_valid,
                                        filename=f'./dae_fold_{k}', 
                                        random_state=config.random_state)
    
    model, history = model_fitting(X_train, y_train, X_valid, y_valid,
                                autoencoder=autoencoder,
                                filename=f"dnn_model_fold_{k}", 
                                random_state=config.random_state)
    
    val_preds = model.predict(X_valid, batch_size=128)
    best_score = eval_gini(y_true=y_valid, 
                           y_pred=np.ravel(val_preds))
    best_epoch = np.argmax(history.history['val_auc']) + 1
    print(f"[best epoch is {best_epoch}]\tvalidation_0-gini_dnn: {best_score:0.5f}\n")
    
    metric_evaluations.append(best_score)
    preds += (model.predict(test, batch_size=128).ravel() / 
              skf.n_splits)
    oof[valid_idx] = model.predict(X_valid, batch_size=128).ravel()
    gpu_cleanup([autoencoder, model])

As we did with the LighGBM model, we can get an idea of the results by looking at the average fold normalized Gini coefficient:

print(f"DNN CV normalized Gini coefficient: {np.mean(metric_evaluations):0.3f} ({np.std(metric_evaluations):0.3f})")

The results won’t be quite in line with what was previously obtained using the LightGBM:

DNN CV Gini Normalized Score: 0.276 (0.015)

Producing the submission and submitting it will result in a public score of about 0.27737 and a private score of about 0.28471 (results may vary wildly as we previously mentioned) – not quite a high score:

submission['target'] = preds
submission.to_csv('dnn_submission.csv')
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)

The scarce results from the neural network seem to confirm the idea that neural networks underperform in tabular problems. As Kagglers, anyway, we know that all models are useful for a successful placing on the leaderboard; we just need to figure out how to best use them. Surely, a neural network feed with an autoencoder has worked out a solution less affected by noise in data and elaborated the information in a different way than a GBM.

Ensembling the results

Now, having two models, what’s left is to mix them together and see if we can improve the results. As suggested by Jahrer we go straight for a blend of them, but we do not limit ourselves to producing just an average of the two (since our approach in the end has slightly differed from Jahrer’s one) and we will also try to get optimal weights for the blend. We start importing the out-of-fold predictions and get our evaluation function ready:

import pandas as pd
import numpy as np
from numba import jit
@jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini
lgb_oof = pd.read_csv("../input/workbook-lgb/lgb_oof.csv")
dnn_oof = pd.read_csv("../input/workbook-dae/dnn_oof.csv")
target = pd.read_csv("../input/porto-seguro-safe-driver-prediction/train.csv", usecols=['id','target'])

Once done, we convert the out-of-fold predictions of the LightGBM and the predictions of the neural network into ranks. We are doing so because the normalized Gini coefficient is based on rankings (as a ROC-AUC evaluation would be) and consequently blending rankings works better than blending the predicted probabilities::

lgb_oof_ranks = (lgb_oof.target.rank() / len(lgb_oof))
dnn_oof_ranks = (dnn_oof.target.rank() / len(dnn_oof))

Now we just test if, by combining the two models using different weights, we can get a better evaluation of the out-of-fold data:

baseline = eval_gini(y_true=target.target, y_pred=lgb_oof_ranks)
print(f"starting from a oof lgb baseline {baseline:0.5f}\n")
best_alpha = 1.0
for alpha in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    ensemble = alpha * lgb_oof_ranks + (1.0 - alpha) * dnn_oof_ranks
    score = eval_gini(y_true=target.target, y_pred=ensemble)
    print(f"lgd={alpha:0.1f} dnn={(1.0 - alpha):0.1f} -> {score:0.5f}")
    
    if score > baseline:
        baseline = score
        best_alpha = alpha
        
print(f"\nBest alpha is {best_alpha:0.1f}")

When ready, by running the snippet we can get interesting results:

starting from a oof lgb baseline 0.28850
lgd=0.1 dnn=0.9 -> 0.27352
lgd=0.2 dnn=0.8 -> 0.27744
lgd=0.3 dnn=0.7 -> 0.28084
lgd=0.4 dnn=0.6 -> 0.28368
lgd=0.5 dnn=0.5 -> 0.28595
lgd=0.6 dnn=0.4 -> 0.28763
lgd=0.7 dnn=0.3 -> 0.28873
lgd=0.8 dnn=0.2 -> 0.28923
lgd=0.9 dnn=0.1 -> 0.28916
Best alpha is 0.8

It seems that blending a strong weight (0.8) on the LightGBM model and a weaker one (0.2) on the neural network will bring an even better-performing model. We immediately try this hypothesis by setting a blend of the same weights for the models and the ideal weights that we have found:

lgb_submission = pd.read_csv("../input/workbook-lgb/lgb_submission.csv")
dnn_submission = pd.read_csv("../input/workbook-dae/dnn_submission.csv")
submission = pd.read_csv(
"../input/porto-seguro-safe-driver-prediction/sample_submission.csv")

First, we try the equal weights solution, which was the strategy used by Michael Jahrer:

lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))
dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * 0.5 + dnn_ranks * 0.5
submission.to_csv("equal_blend_rank.csv", index=False)

It leads to a public score of 0.28393 and a private score of 0.29093, which is around 50^th position on the final leaderboard, a bit far from our expectations. Now let’s try using the weights that the out-of-fold predictions helped us to find:

lgb_ranks = (lgb_submission.target.rank() / len(lgb_submission))
dnn_ranks = (dnn_submission.target.rank() / len(dnn_submission))
submission.target = lgb_ranks * best_alpha +  dnn_ranks * (1.0 - best_alpha)
submission.to_csv("blend_rank.csv", index=False)

Here the results lead to a public score of 0.28502 and a private score of 0.29192, which turns out to be around the seventh position on the final leaderboard. A much better result indeed because the LightGBM is a good model, but it is probably missing some nuances in the data that can be provided by adding some information from the neural network trained on the denoised data.

Exercise 6

As pointed out by CPMP in their solution (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44614), depending on how to build your cross-validation, you can experience a “huge variation of Gini scores among folds.” For this reason, CPMP suggests decreasing the variance of the estimates by using many different seeds for multiple cross-validations and averaging the results.

As an exercise, try to modify the code we used to create more stable predictions, especially for the denoising autoencoder.

Exercise Notes (write down any notes or workings that will help you):

Summary

In this first chapter, you have dealt with a classical tabular competition. By reading the notebooks and discussions of the competition, we have come up with a simple solution involving just two models that can be easily blended. In particular, we have offered an example of how to use a denoising autoencoder in order to produce alternative data processing, particularly useful when operating with neural networks for tabular data. By understanding and replicating solutions from past competitions, you can quickly build up your core competencies on Kaggle competitions and quickly become able to perform consistently higher in more recent competitions and challenges.

In the next chapter, we will explore another tabular competition from Kaggle, this time revolving around a complex prediction problem with time series.

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 2000 members at:

https://packt.link/KaggleDiscord

About the Authors

Konrad Banachewicz

Konrad Banachewicz is the author of the bestselling, The Kaggle Book and The Kaggle Workbook. He is a data science manager with experience stretching longer than he likes to ponder on. He holds a PhD in statistics from Vrije Universiteit Amsterdam, where he focused on problems of extreme dependency modeling in credit risk. He slowly moved from classic statistics towards machine learning and into the business applications world.
Browse publications by this author
Luca Massaron

Having joined Kaggle over 10 years ago, Luca Massaron is a Kaggle Grandmaster in discussions and a Kaggle Master in competitions and notebooks. In Kaggle competitions he reached no. 7 in the worldwide rankings. On the professional side, Luca is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is a Google Developer Expert(GDE) in machine learning and the author of best-selling books on AI, machine learning, and algorithms.
Browse publications by this author