Home Data The Regularization Cookbook

The Regularization Cookbook

By Vincent Vandenbussche
books-svg-icon Book
eBook $47.99 $32.99
Print $59.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $47.99 $32.99
Print $59.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 2: Machine Learning Refresher
About this book
Regularization is an infallible way to produce accurate results with unseen data, however, applying regularization is challenging as it is available in multiple forms and applying the appropriate technique to every model is a must. The Regularization Cookbook provides you with the appropriate tools and methods to handle any case, with ready-to-use working codes as well as theoretical explanations. After an introduction to regularization and methods to diagnose when to use it, you’ll start implementing regularization techniques on linear models, such as linear and logistic regression, and tree-based models, such as random forest and gradient boosting. You’ll then be introduced to specific regularization methods based on data, high cardinality features, and imbalanced datasets. In the last five chapters, you’ll discover regularization for deep learning models. After reviewing general methods that apply to any type of neural network, you’ll dive into more NLP-specific methods for RNNs and transformers, as well as using BERT or GPT-3. By the end, you’ll explore regularization for computer vision, covering CNN specifics, along with the use of generative models such as stable diffusion and Dall-E. By the end of this book, you’ll be armed with different regularization techniques to apply to your ML and DL models.
Publication date:
July 2023
Publisher
Packt
Pages
424
ISBN
9781837634088

 

Machine Learning Refresher

Machine learning (ML) is much more than just models. It is about following a certain process and best practices. This chapter will provide a refresher on these: from loading data and model evaluation to model training and optimization, the main steps and methods will be explained here.

In this chapter, we are going to cover the following main topics:

  • Loading data
  • Splitting data
  • Preparing quantitative data
  • Preparing qualitative data
  • Training a model
  • Evaluating a model
  • Performing hyperparameter optimization

Even though the recipes in this chapter are independent from a methodological standpoint, they build upon each other and are meant to be executed sequentially.

 

Technical requirements

In this chapter, you will need to be able to run code to load datasets, prepare data, and train, optimize, and evaluate ML models. To do so, you will need the following libraries:

  • numpy
  • pandas
  • scikit-learn

They can be installed using pip with the following command line:

pip install numpy pandas scikit-learn

Note

In this book, some best practices such as using virtual environments won’t be explicitly mentioned. However, it is highly recommended that you use a virtual environment before installing any library using pip or any other package manager.

 

Loading data

The primary focus of this recipe is to load data from a CSV file. However, this is not the only thing that this recipe covers. Since the data is usually the first step in any ML project, this recipe is also a good opportunity to give a quick recap of the ML workflow, as well as the different types of data.

Getting ready

Before loading the data, we should keep in mind that an ML model follows a two-step process:

  1. Train a model on a given dataset to create a new model.
  2. Reuse the previously trained model to infer predictions on new data.

These two steps are summarized in the following figure:

Figure 2.1 – A simple view of the two-step ML process

Figure 2.1 – A simple view of the two-step ML process

Of course, in most cases, this is a rather simplistic view. A more detailed view can be seen in Figure 2.2:

Figure 2.2 – A more complete view of the ML process

Figure 2.2 – A more complete view of the ML process

Let’s take a closer look at the training part of the ML process shown in Figure 2.2:

  1. First, training data is queried from a data source (this can be a database, a data lake, an open dataset, and so on).
  2. The data is preprocessed, such as via feature engineering, rescaling, and so on.
  3. A model is trained and stored (on a data lake, locally, on the edge, and so on).
  4. Optionally, the output of this model is post-processed – for example, via formatting, heuristics, business rules, and more.
  5. Optionally again, this model (with or without postprocessing) is stored in a database for later reference or evaluation if needed.

Now, let’s take a look at the inference part of the ML process:

  1. The data is queried from a data source (a database, an API query, and so on).
  2. The data goes through the same preprocessing step as the training data.
  3. The trained model is fetched if it doesn’t already exist locally.
  4. The model is used to infer output.
  5. Optionally, the output of the model is post-processed via the same post-processing step as the training data.
  6. Optionally, the output is stored in a database for monitoring and later reference.

Even in this schema, many steps were not mentioned: splitting data for training purposes, using evaluation metrics, cross-validation, hyperparameter optimization, and others. This chapter will dive into the more training-specific steps and apply them to the very common but practical Titanic dataset, a binary classification problem. But first, we need to load the data.

To do so, you must download the Titanic dataset training set locally. This can be performed with the following command line:

wget https://github.com/PacktPublishing/The-Regularization-Cookbook/blob/main/chapter_02/train.csv

How to do it…

This recipe is about loading a CSV file and displaying a few lines of code so that we can have a first glance at what it is about:

  1. The first step is to import the required libraries. Here, the only library we need is pandas:
    import pandas as pd
  2. Now, we can load the data using the read_csv function provided by pandas. The first argument is the path to the file. Assuming the file is named train.csv and located in the current folder, we only have to provide train.csv as an argument:
    # Load the data as a DataFrame
    df = pd.read_csv('train.csv')

The returned object is a dataframe object, which provides many useful methods for data processing.

  1. Now, we can display the first five lines of the loaded file using the .head() method:
    # Display the first 5 rows of the dataset
    df.head()

This code will output the following:

   PassengerId  Survived  Pclass  \
0        1            0         3
1        2            1         1
2        3            1         3
3        4            1         1
4        5            0         3
      Name                      Sex   Age     SibSp  \
0   Braund, Mr. Owen Harris     male  22.0       1
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
                               female  38.0        1
2  Heikkinen, Miss. Laina  female  26.0        0
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)
                            female  35.0        1
4  Allen, Mr. William Henry     male  35.0        0
 Parch      Ticket   Fare   Cabin        Embarked
0  0         A/5   21171   7.2500   NaN           S
1  0       PC 17599  71.2833   C85       C
2  0      STON/O2. 3101282   7.9250   NaN       S
3  0        113803  53.1000  C123           S
4  0        373450   8.0500   NaN    S

Here is a description of the data types in each column:

  • PassengerId (qualitative): A unique, arbitrary ID for each passenger.
  • Survived (qualitative): 1 for yes, 0 for no. This is our label, so this is a binary classification problem.
  • Pclass (quantitative, discrete): The class, which is arguably quantitative. Is class 1 better than class 2? Most likely yes.
  • Name (unstructured): The name and title of the passenger.
  • Sex (qualitative): The registered sex of the passenger, either male or female.
  • Age (quantitative, discrete): The age of the passenger.
  • SibSp (quantitative, discrete): The number of siblings and spouses on board.
  • Parch (quantitative, discrete): The number of parents and children on board.
  • Ticket (unstructured): The ticket reference.
  • Fare (quantitative, continuous): The ticket price.
  • Cabin (unstructured): The cabin number, which is arguably unstructured. It can be seen as a qualitative feature with high cardinality.
  • Embarked (qualitative): The embarked city, either Southampton (S), Cherbourg (C), or Queenstown (Q).

There’s more…

Let’s talk about the different types of data that are available. Data is a very generic word and can describe many things. We are surrounded by data all the time. One way to specify data is using opposites.

Data can be structured or unstructured:

  • Structured data comes in the form of tables, databases, Excel files, CSV files, and JSON files.
  • Unstructured data does not fit in a table: it can be text, sound, image, videos, and so on. Even if we tend to have tabular representation, this kind of data does not naturally fit in an Excel table.

Data can be quantitative or qualitative.

Quantitative data is ordered. Here are some examples:

  • €100 is greater than €10
  • 1.8 meters is taller than 1.6 meters
  • 18 years old is younger than 80 years old

Qualitative data has no intrinsic order, as shown here:

  • Blue is not intrinsically better than red
  • A dog is not intrinsically greater than a cat
  • A kitchen is not intrinsically more useful than a bathroom

These are not mutually exclusive. An object can have both quantitative and qualitative features, as can be seen in the case of the car in the following figure:

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Finally, data can be continuous or discrete.

Some data is continuous, as follows:

  • A weight
  • A volume
  • A price

On the other hand, some data is discrete:

  • A color
  • A football score
  • A nationality

Note

Discrete != qualitative.

For example, a football score is discrete, but there is an intrinsic order: 3 points is more than 2.

See also

The pandas read_csv function has a lot of flexibility as it can use other separators, handle headers, and much more. This is described in the official documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

The pandas library allows I/O operations that have different types of inputs. For more information, have a look at the official documentation: https://pandas.pydata.org/docs/reference/io.html.

 

Splitting data

After loading data, splitting it is a crucial step. This recipe will explain why we need to split data, as well as how to do it.

Getting ready

Why do we need to split data? An ML model is quite like a student.

You provide a student with many lectures and exercises, with or without the answers. But more often than not, students are evaluated on a completely new problem. To make sure they fully understand the concepts and methods, they not only learn the exercises and solutions – they also understand the underlying concepts.

An ML model is no different: you train the model on training data and then evaluate it on test data. This way, you make sure the model fully understands the task and generalizes well to new, unseen data.

So, the dataset is usually split into train and test sets:

  • The train set must be as large as possible to give as many samples as possible to the model
  • The test set must be large enough to be statistically significant in evaluating the model

Typical splits can be anywhere between 80% to 20% for rather small datasets (for example, hundreds of samples), and 99% to 1% for very large datasets (for example, millions of samples and more).

For this recipe and the others in this chapter, it is assumed that the code has been executed in the same notebook as the previous recipe since each recipe reuses the code from the previous ones.

How to do it…

Here are the steps to try out this recipe:

  1. You can split the data rather easily with scikit-learn and the train_test_split() function:
    # Import the train_test_split function
    from sklearn.model_selection import train_test_split
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop(columns=['Survived']), df['Survived'],
        test_size=0.2, stratify=df['Survived'],
        random_state=0)

This function uses the following parameters as input:

  • X: All columns but the 'Survived' label
  • y: The 'Survived' label column
  • test_size: This is 0.2, which means the training size will be 80%
  • stratify: This specifies the 'Survived' column to ensure the same label balance is used in both splits
  • random_state: 0 is any integer to ensure reproducibility

It returns the following outputs:

  • X_train: The train split of X
  • X_test: The test split of X
  • y_train: The training split of y, associated with X_train
  • y_test: The test split of y, associated with X_test

Note

The stratify option is not mandatory but can be critical to ensure a balanced split of any qualitative feature, not just the labels, as is the case with imbalanced data.

This split should be done as early as possible when performing data processing so that you avoid any potential data leakage. From now on, all the preprocessing will be computed on the train set, and only then applied to the test set, in agreement with Figure 2.2.

See also

See the official documentation for the train_test_split function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

 

Preparing quantitative data

Depending on the type of data, how the features must be prepared may differ. In this recipe, we’ll cover how to prepare quantitative data, including missing data imputation and rescaling.

Getting ready

In the Titanic dataset, as well as any other dataset, there may be missing data. There are several ways to deal with missing data. For example, you can drop a column or a row, or impute a value. There are many imputation techniques, some of which are more or less sophisticated. scikit-learn supplies several implementations of imputers, such as SimpleImputer and KNNImputer.

As we will see in this recipe, using SimpleImputer, we can impute the missing quantitative data with the mean value.

Once the missing data has been handled, we can prepare the quantitative data by rescaling it so that all the data is at the same scale.

Several rescaling strategies exist, such as min-max scaling, robust scaling, standard scaling, and others.

In this recipe, we will use standard scaling. So, for each feature, we will subtract the mean value of this feature, and then divide it by the standard deviation of that feature:

Fortunately, scikit-learn provides a fully working implementation via StandardScaler.

How to do it…

We will sequentially handle missing values and rescale the data in this recipe:

  1. Import the required classes – SimpleImputer for missing data imputation and StandardScaler for rescaling:
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
  2. Select the quantitative features we want to keep. Here, we will keep 'Pclass', 'Age', 'Fare', 'SibSp', and 'Parch' and store these features in new variables for both the train and test sets:
    quanti_columns = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
    # Get the quantitative columns
    X_train_quanti = X_train[quanti_columns]
    X_test_quanti = X_test[quanti_columns]
  3. Instantiate the simple imputer with a mean strategy. Here, the missing value of a feature will be replaced with the mean value of that feature:
    # Impute missing quantitative values with mean feature value
    quanti_imputer = SimpleImputer(strategy='mean')
  4. Fit the imputer on the train set and apply it to the test set so that it avoids leakage in the imputation:
    # Fit and impute the training set
    X_train_quanti = quanti_imputer.fit_transform(X_train_quanti)
    # Just impute the test set
    X_test_quanti = quanti_imputer.transform(X_test_quanti)
  5. Now that imputation has been performed, instantiate the scaler object:
    # Instantiate the standard scaler
    scaler = StandardScaler()
  6. Finally, fit and apply the standard scaler to the train set, and then apply it to the test set:
    # Fit and transform the training set
    X_train_quanti = scaler.fit_transform(X_train_quanti)
    # Just transform the test set
    X_test_quanti = scaler.transform(X_test_quanti)

We now have quantitative data with no missing values, fully rescaled, with no data leakage.

There’s more…

In this recipe, we used the simple imputer, assuming there was missing data. In practice, it is highly recommended that you look at the data first to check whether there are missing values, as well as how many. It is possible to look at the number of missing values per column with the following code snippet:

# Display the number of missing data for each column
X_train[quanti_columns].isna().sum()

This will output the following:

Pclass        0
Age         146
Fare           0
SibSp         0
Parch         0

Thanks to this, we know that the Age feature has 146 missing values, while the other features have no missing data.

See also

A few imputers are available in scikit-learn. The list is available here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute.

There are many ways to scale data, and you can find the methods that are available in scikit-learn here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing.

You might be interested in looking at this comparison of several scalers on some given data: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py.

 

Preparing qualitative data

In this recipe, we will prepare qualitative data, including missing value imputation and encoding.

Getting ready

Qualitative data requires different treatment from quantitative data. Imputing missing values with the mean value of a feature would make no sense (and would not work with non-numeric data): it makes more sense, for example, to use the most frequent value or the mode of a feature. The SimpleImputer class allows us to do such things.

The same goes for rescaling: it would make no sense to rescale qualitative data. Instead, it is more common to encode it. One of the most typical techniques is called one-hot encoding.

The idea is to transform each of the categories, over a total possible N categories, in a vector holding a 1 and N-1 zeros. In our example, the Embarked feature’s one-hot encoding would be as follows:

  • ‘C’ = [1, 0, 0]
  • ‘Q’ = [0, 1, 0]
  • ‘S’ = [0, 0, 1]

Note

Having N columns for N categories is not necessarily optimal. What happens if, in the preceding example, we remove the first column? If the value is not ‘Q’ = [1, 0] nor ‘S’ = [0, 1], then it must be ‘C’ = [0, 0]. There is no need to add one more column to have all the necessary information. This can be generalized to N categories only requiring N-1 columns to have all the information, which is why one-hot encoding functions usually allow you to drop a column.

The sklearn class’ OneHotEncoder allows us to do this. It also allows us to deal with unknown categories that may appear in the test set (or the production environment) with several strategies, such as an error, ignore, or infrequent class. Finally, it allows us to drop the first column after encoding.

How to do it…

Just like in the preceding recipe, we will handle any missing data and the features will be one-hot encoded:

  1. Import the necessary classes – SimpleImputer for missing data imputation (already imported in the previous recipe) and OneHotEncoder for encoding. We also need to import numpy so that we can concatenate the qualitative and quantitative data that’s been prepared at the end of this recipe:
    import numpy as np
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
  2. Select the qualitative features we want to keep: 'Sex' and 'Embarked'. Then, store these features in new variables for both the train and test sets:
    quali_columns = ['Sex', 'Embarked']
    # Get the quantitative columns
    X_train_quali = X_train[quali_columns]
    X_test_quali = X_test[quali_columns]
  3. Instantiate SimpleImputer with most_frequent strategy. Any missing values will be replaced by the most frequent ones:
    # Impute missing qualitative values with most frequent feature value
    quali_imputer =SimpleImputer(strategy='most_frequent')
  4. Fit and transform the imputer on the train set, and then transform the test set:
    # Fit and impute the training set
    X_train_quali = quali_imputer.fit_transform(X_train_quali)
    # Just impute the test set
    X_test_quali = quali_imputer.transform(X_test_quali)
  5. Instantiate the encoder. Here, we will specify the following parameters:
    • drop='first': This will drop the first columns of the encoding
    • handle_unknown='ignore': If a new value appears in the test set (or in production), it will be encoded as zeros:
      # Instantiate the encoder
      encoder=OneHotEncoder(drop='first', handle_unknown='ignore')
  6. Fit and transform the encoder on the training set, and then transform the test set using this encoder:
    # Fit and transform the training set
    X_train_quali = encoder.fit_transform(X_train_quali).toarray()
    # Just encode the test set
    X_test_quali = encoder.transform(X_test_quali).toarray()

Note

We need to use .toarray() out of the encoder because the array is a sparse matrix object by default and cannot be concatenated in that form with the other features.

  1. With that, all the data has been prepared – both quantitative and qualitative (considering this recipe and the previous one). It is now possible to concatenate this data before training a model:
    # Concatenate the data back together
    X_train = np.concatenate([X_train_quanti,
        X_train_quali], axis=1)
    X_test = np.concatenate([X_test_quanti, X_test_quali], axis=1)

There’s more…

It is possible to save the data as a pickle file, either to share it or save it and avoid having to prepare it again. The following code will allow us to do this:

import pickle
pickle.dump((X_train, X_test, y_train, y_test),
    open('prepared_titanic.pkl', 'wb'))

We now have fully prepared data that can be used to train ML models.

Note

Several steps have been omitted or simplified here for more clarity. Data may need more preparation, such as more thorough missing value imputation, outlier and duplicate detection (and perhaps removal), feature engineering, and so on. It is assumed that you already have some sense of those aspects and are encouraged to read other materials about this topic if required.

See also

This more general documentation about missing data imputation is worth looking at: https://scikit-learn.org/stable/modules/impute.html.

Finally, this more general documentation about data preprocessing can be very useful: https://scikit-learn.org/stable/modules/preprocessing.html.

 

Training a model

Once data has been fully cleaned and prepared, it is fairly easy to train a model thanks to scikit-learn. In this recipe, before training a logistic regression model on the Titanic dataset, we will quickly recap the ML paradigm and the different types of ML we can use.

Getting ready

If you were asked how to differentiate a car from a truck, you may be tempted to provide a list of rules, such as the number of wheels, size, weight, and so on. By doing so, you would be able to provide a set of explicit rules that would allow anyone to identify a car and a truck as different types of vehicles.

Traditional programming is not so different. While developing algorithms, programmers often build explicit rules, which allow them to map from data input (for example, a vehicle) to answers (for example, a car). We can summarize this paradigm as data + rules = answers.

If we were to train an ML model to discriminate cars from trucks, we would use another strategy: we would feed an ML algorithm with many pieces of data and their associated answers, expecting the model to learn to correct rules by itself. This is a different approach that can be summarized as data + answers = rules. This paradigm difference is summarized in Figure 2.4. As little as it might look to ML practitioners, it changes everything in terms of regularization:

Figure 2.4 – Comparing traditional programming with ML algorithms

Figure 2.4 – Comparing traditional programming with ML algorithms

Regularizing traditional algorithms is conceptually straightforward. For example, what if the rules for defining a truck overlap with the bus definition? If so, we can add the fact that buses have lots of windows.

Regularization in ML is intrinsically implicit. What if the model in this case does not discriminate between buses and trucks?

  • Should we add more data?
  • Is the model complex enough to capture such a difference?
  • Is it underfitting or overfitting?

This fundamental property of ML makes regularization complex.

ML can be applied to many tasks. Anyone who uses ML knows there is not just one type of ML model.

Arguably, most ML models fall into three main categories:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

As is usually the case for categories, the landscape is more complex, with sub-categories and methods overlapping several categories. But this is beyond the scope of this book.

This book will focus on regularization for supervised learning. In supervised learning, the problem is usually quite easy to specify: we have input features, X (for example, apartment surface), and labels, y (for example, apartment price). The goal is to train a model so that it’s robust enough to predict y, given X.

The two major types of ML are classification and regression:

  • Classification: The labels are made of qualitative data. For example, the task is predicting between two or more classes such as car, bus, and truck.
  • Regression: The labels are made of quantitative data. For example, the task is predicting an actual value, such as an apartment price.

Again, the line can be blurry; some tasks can be solved with classification while the labels are quantitative data, while others tasks can be both classification and regression ones. See Figure 2.5:

Figure 2.5 – Regularization versus classification

Figure 2.5 – Regularization versus classification

How to do it…

Assuming we want to train a logistic regression model (which will be explained properly in the next chapter), the scikit-learn library provides the LogisticRegression class, along with the fit() and predict() methods. Let’s learn how to use it:

  1. Import the LogisticRegression class:
    from sklearn.linear_model import LogisticRegression
  2. Instantiate a LogisticRegression object:
    # Instantiate the model
    lr = LogisticRegression()
  3. Fit the model on the train set:
    # Fit on the training data
    lr.fit(X_train, y_train)
  4. Optionally, compute predictions by using that model on the test set:
    # Compute and store predictions on the test data
    y_pred = lr.predict(X_test)

See also

Even though more details will be provided in the next chapter, you might be interested in looking at the documentation of the LogisticRegression class: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

 

Evaluating a model

Once the model has been trained, it is important to evaluate it. In this recipe, we will provide a few insights about a few typical metrics for both classification and regression, before evaluating our model on the test set.

Getting ready

Many evaluation metrics exist. If we think about predicting a binary classification and take a step back, there are only four cases:

  • False positive (FP): Positive prediction, negative ground truth
  • True positive (TP): Positive prediction, positive ground truth
  • True negative (TN): Negative prediction, negative ground truth
  • False negative (FN): Negative prediction, positive ground truth:
Figure 2.6 – Representation of false positive, true positive, true negative, and false negative

Figure 2.6 – Representation of false positive, true positive, true negative, and false negative

Based on this, we can define a wide range of evaluation metrics.

One of the most common metrics is accuracy, which is the ratio of good predictions. The definition of accuracy is as follows:

Note

Although very common, the accuracy may be misleading, especially for imbalanced labels. For example, let’s assume an extreme case where 99% of Titanic passengers survived, and we have a model that predicts that every passenger survived. Our model would have a 99% accuracy but would be wrong for 100% of passengers who did not survive.

There are several other very common metrics, such as precision, recall, and the F1 score.

Precision is most suited when you’re trying to maximize the true positives and minimize the false positives – for example, making sure you detect only surviving passengers:

Recall is most suited when you’re trying to maximize the true positives and minimize the false negatives – for example, making sure you don’t miss any surviving passengers:

The F1 score is just a combination of the precision and recall metrics as a harmonic mean:

Another useful classification evaluation metric is the Receiver Operating Characteristic Area Under Curve (ROC AUC) score.

All these metrics behave similarly: when there are values between 0 and 1, the higher the value, the better the model. Some are also more robust to imbalanced labels, especially the F1 score and ROC AUC.

For regression tasks, the most used metrics are the mean squared error (MSE) and the R2 score.

The MSE is the averaged square difference between the predictions and the ground truth:

Here, m is the number of samples, ŷ is the predictions, and y is the ground truth:

Figure 2.7 – Visualization of the errors for a regression task

Figure 2.7 – Visualization of the errors for a regression task

In terms of the R2 score, it is a metric that can be negative and is defined as follows:

Note

While the R2 score is a typical evaluation metric (the closer to 1, the better), the MSE is more typical of a loss function (the closer to 0, the better).

How to do it…

Assuming our chosen evaluation metric here is accuracy, a very simple way to evaluate our model is to use the accuracy_score() function:

from sklearn.metrics import accuracy_score
# Compute the accuracy on test of our model
print('accuracy on test set:', accuracy_score(y_pred,
    y_test))

This outputs the following:

accuracy on test set: 0.7877094972067039

Here, the accuracy_score() function provides an accuracy of 78.77%, meaning about 79% of our model’s predictions are right.

See also

Here is a list of the available metrics in scikit-learn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

 

Performing hyperparameter optimization

In this recipe, we will explain what hyperparameter optimization is and some related concepts: the definition of a hyperparameter, cross-validation, and various hyperparameter optimization methods. We will then perform a grid search to optimize the hyperparameters of the logistic regression task on the Titanic dataset.

Getting ready

Most of the time, in ML, we do not simply train a model on the training set and evaluate it against the test set.

This is because, like most other algorithms, ML algorithms can be fine-tuned. This fine-tuning process allows us to optimize hyperparameters to achieve the best possible results. This sometimes acts as leverage so that we can regularize a model.

Note

In ML, hyperparameters can be tuned by humans, unlike parameters, which are learned through the model training process, and thus can’t be tuned.

To properly optimize hyperparameters, a third split has to be introduced: the validation set.

This means there are now three splits:

  • The training set: Where the model is trained
  • The validation set: Where the hyperparameters are optimized
  • The test set: Where the model is evaluated

You could create such a set by splitting X_train into X_train and X_valid with the train_test_split() function from scikit-learn.

But in practice, most people just use cross-validation and do not bother creating this validation set. The k-fold cross-validation method allows us to make k splits out of the training set and divide it, as presented in Figure 2.8:

Figure 2.8 – Typical split between training, validation, and test sets, without cross-validation (top) and with cross-validation (bottom)

Figure 2.8 – Typical split between training, validation, and test sets, without cross-validation (top) and with cross-validation (bottom)

In doing so, not just one model is trained, but k, for a given set of hyperparameters. The performances are averaged over those k models, based on a chosen metric (for example, accuracy, MSE, and so on).

Several sets of hyperparameters can then be tested, and the one that shows the best performance is selected. After selecting the best hyperparameter set, the model is trained one more time on the entire train set to maximize the data for training purposes.

Finally, you can implement several strategies to optimize the hyperparameters, as follows:

  • Grid search: Test all combinations of the provided values of hyperparameters
  • Random search: Randomly search combinations of hyperparameters
  • Bayesian search: Perform Bayesian optimization on the hyperparameters

How to do it…

While being rather complicated to explain conceptually, hyperparameter optimization with cross-validation is super easy to implement. In this recipe, we’ll assume that we want to optimize a logistic regression model to predict whether a passenger would have survived:

  1. First, we need to import the GridSearchCV class from sklearn.model_selection.
  2. We would like to test the following hyperparameter values for C: [0.01, 0.03, 0.1]. We must define a parameter grid with the hyperparameter as the key and the list of values to test as the value.

The C hyperparameter is the inverse of the penalization strength: the higher C is, the lower the regularization. See the next chapter for more details:

# Define the hyperparameters we want to test
param_grid = { 'C': [0.01, 0.03, 0.1] }
  1. Finally, let’s assume we want to optimize our model on accuracy, with five cross-validation folds. To do this, we will instantiate the GridSearchCV object and provide the following arguments:
    • The model to optimize, which is a LogisticRegression instance
    • The parameter grid, param_grid, which we defined previously
    • The scoring on which to optimize – that is, accuracy
    • The number of cross-validation folds, which has been set to 5 here
  2. We must also set return_train_score to True to get some useful information we can use later:
    # Instantiate the grid search object
    grid = GridSearchCV(
        LogisticRegression(),
        param_grid,
        scoring='accuracy',
        cv=5,
        return_train_score=True
    )
  3. Finally, all we have to do is train this object on the train set. This will automatically make all the computations and store the results:
    # Fit and wait
    grid.fit(X_train, y_train)
    GridSearchCV(cv=5, estimator=LogisticRegression(),
        param_grid={'C': [0.01, 0.03, 0.1]},
        return_train_score=True, scoring='accuracy')

Note

Depending on the input dataset and the number of tested hyperparameters, the fit may take some time.

Once the fit has been completed, you can get a lot of useful information, such as the following:

  • The hyperparameter set via the .best_params attribute
  • The best accuracy score via the .best_score attribute
  • The cross-validation results via the .cv_results attribute
  1. Finally, you can infer the model that was trained with optimized hyperparameters using the .predict() method:
    y_pred = grid.predict(X_test)
  2. Optionally, you can evaluate the chosen model with the accuracy score:
    print('Hyperparameter optimized accuracy:',
        accuracy_score(y_pred, y_test))

This provides the following output:

Hyperparameter optimized accuracy: 0.781229050279329

Thanks to the tools provided by scikit-learn, it is fairly easy to have a well-optimized model and evaluate it against several metrics. In the next recipe, we’ll learn how to diagnose bias and variance based on such an evaluation.

See also

The documentation for GridSearchCV can be found at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

About the Author
  • Vincent Vandenbussche

    After a Ph.D. in Physics, Vincent Vandenbussche has worked for a decade in the industry, deploying ML solutions at scale. He has worked in numerous companies, such as Renault, L’Oréal, General Electric, Jellysmack, Chanel, and CERN. He also has a passion for teaching: he co-founded a data science bootcamp, was an ML lecturer at Mines Paris engineering school and EDHEC business school and trained numerous professionals in companies like ArcelorMittal and Orange.

    Browse publications by this author
The Regularization Cookbook
Unlock this book and the full library FREE for 7 days
Start now