Reader small image

You're reading from  Hands-On Mathematics for Deep Learning

Product typeBook
Published inJun 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838647292
Edition1st Edition
Languages
Right arrow
Author (1)
Jay Dawani
Jay Dawani
author image
Jay Dawani

Jay Dawani is a former professional swimmer turned mathematician and computer scientist. He is also a Forbes 30 Under 30 Fellow. At present, he is the Director of Artificial Intelligence at Geometric Energy Corporation (NATO CAGE) and the CEO of Lemurian Labs - a startup he founded that is developing the next generation of autonomy, intelligent process automation, and driver intelligence. Previously he has also been the technology and R&D advisor to Spacebit Capital. He has spent the last three years researching at the frontiers of AI with a focus on reinforcement learning, open-ended learning, deep learning, quantum machine learning, human-machine interaction, multi-agent and complex systems, and artificial general intelligence.
Read more about Jay Dawani

Right arrow

Regularization

In the previous chapter, we learned about (deep) feedforward neural networks and how they are structured. We learned how these architectures can leverage their hidden layers and non-linear activations to learn to perform well on some very challenging tasks, which linear models aren't able to do. We also saw that neural networks tend to overfit to the training data by learning noise in the dataset, which leads to errors in the testing data. Naturally, since our goal is to create models that generalize well, we want to close the gap so that our models perform just as well on both datasets. This is the goal of regularization—to reduce test error, sometimes at the expense of greater training error.

In this chapter, we will cover a variety of methods used in regularization, how they work, and why certain techniques are preferred over others. This includes...

The need for regularization

In previous chapters, we learned how feedforward neural networks are basically a complex function that maps an input to a corresponding target/label by learning the underlying distribution using the training data. We can recall that during training, after an error has been calculated during the forward pass, backpropagation is used to update the parameters in order to reduce the loss and better approximate the data distribution. We also learned about the capacity of neural networks, the bias-variance trade-off, and how neural networks can underfit or overfit to the training data, which prevents it from being able to perform well on unseen data or test data (that is, a generalization error occurs).

Before we get into what exactly regularization is, let's revisit overfitting and underfitting. Neural networks, as we know, are universal function approximators...

Norm penalties

Adding a parameter norm penalty to the objective function is the most classic of the regularization methods. What this does is limit the capacity of the model. This method has been around for several decades and predates the advent of deep learning. We can write this as follows:

Here, . The α value, in the preceding equation, is a hyperparameter that determines how large a regularizing effect the regularizer will have on the regularized cost function. The greater the value of α is, the more regularization is applied, and the smaller it is, the less of an effect regularization has on the cost function.

In the case of neural networks, we only apply the parameter norm penalties to the weights since they control the interaction or relationship between two nodes in successive layers, and we leave the biases as they are since they need less data in comparison...

Early stopping

During training, we know that our neural networks (which have sufficient capacity to learn the training data) have a tendency to overfit to the training data over many iterations, and then they are unable to generalize what they have learned to perform well on the test set. One way of overcoming this problem is to plot the error on the training and test sets at each iteration and analytically look for the iteration where the error from the training and test sets is the closest. Then, we choose those parameters for our model.

Another advantage of this method is that this in no way alters the objective function in the way that parameter norms do, which makes it easy to use and means it doesn't interfere with the network's learning dynamics, which is shown in the following diagram:

However, this approach isn't perfect—it does have a downside...

Parameter tying and sharing

The preceding parameter norm penalties work by penalizing the model parameters when they deviate from 0 (a fixed value). But sometimes, we may want to express prior knowledge about which parameters would be better suited to the model. Although we may not know what those parameters are, thanks to domain knowledge and the architecture of the model, we know that there are likely to be some dependencies between the parameters of the model.

These dependencies could be some specific parameters that are closer to some than to others. Let's suppose we have two different models for a classification task and detect the same number of classes. Their input distributions, however, are not the same. Let's name the first model A with θ(A) parameters and the second model B with θ(B) parameters. Both of these models map their respective inputs...

Dataset augmentation

Deep feedforward networks, as we have learned, are very data-hungry and they use all this data to learn the underlying data distribution so that they can use their gained knowledge to make predictions on unseen data. This is because the more data they see, the more likely it is that what they encounter in the test set will be an interpolation of the distribution they have already learned. But getting a large enough dataset with good-quality labeled data is by no means a simple task (especially for certain problems where gathering data could end up being very costly). A method to circumvent this issue is using data augmentation; that is, generating synthetic data and using it to train our deep neural network.

The way synthetic data generation works is that we use a generative model (more on this in Chapter 12, Generative Models) to learn the underlying distribution...

Dropout

In the preceding section, we learned about applying penalties to the norm of the weights to regularize them, as well as other approaches, such as dataset augmentation and early stopping. However, there is another effective approach that is widely used in practice, known as dropout.

So far, when training neural networks, all the weights have been learned together. However, dropout alters this idea by having the network only learn a fraction of the weights during each iteration. The reason for this is to avoid co-adaptation. This occurs when we train the entire network over all the training data and some connections end up stronger than others, thereby contributing more toward the network's predictive capabilities because the stronger connections overpower the weaker connections, effectively ignoring them. As we train the network with more iterations, some of the...

Adversarial training

Nowadays, neural networks have started to reach human-level accuracy on a number of tasks, and in some, they can be seen to have even surpassed humans. But have they really surpassed humans or does it just seem this way? In production environments, we often have to deal with noisy data, which can cause our model to make incorrect predictions. So, we will now learn about another very important method of regularization—adversarial training.

Before we get into the what and the how of adversarial training, let's take a look at the following diagram:

What we have done, in the preceding diagram, is added in negligible Gaussian noise to the pixels of the original image. To us, the image looks exactly the same, but to a convolutional neural network, it looks entirely different. This is a problem, and it occurs even when our models are perfectly trained...

Summary

In this chapter, we covered a variety of methods that are used to regularize the parameters of a neural network. These methods are very important when it comes to training our models because they help ensure that they can generalize to unseen data by preventing overfitting, thereby performing well on the tasks we want to use them for. In the following chapters, we will learn about different types of neural networks and how each one is best suited for certain types of problems. Each neural network has a form of regularization that it can use to help improve performance.

In the next chapter, we will learn about convolutional neural networks, which are used for computer vision.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Mathematics for Deep Learning
Published in: Jun 2020Publisher: PacktISBN-13: 9781838647292
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at £13.99/month. Cancel anytime

Author (1)

author image
Jay Dawani

Jay Dawani is a former professional swimmer turned mathematician and computer scientist. He is also a Forbes 30 Under 30 Fellow. At present, he is the Director of Artificial Intelligence at Geometric Energy Corporation (NATO CAGE) and the CEO of Lemurian Labs - a startup he founded that is developing the next generation of autonomy, intelligent process automation, and driver intelligence. Previously he has also been the technology and R&D advisor to Spacebit Capital. He has spent the last three years researching at the frontiers of AI with a focus on reinforcement learning, open-ended learning, deep learning, quantum machine learning, human-machine interaction, multi-agent and complex systems, and artificial general intelligence.
Read more about Jay Dawani