Home Data Mastering PyTorch

Mastering PyTorch

By Ashish Ranjan Jha
books-svg-icon Book
Subscription FREE
eBook + Subscription €14.99
eBook €28.99
Print + eBook €37.99
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
READ FOR FREE Free Trial for 7 days. €14.99 p/m after trial. Cancel Anytime! BUY NOW BUY NOW BUY NOW
Subscription FREE
eBook + Subscription €14.99
eBook €28.99
Print + eBook €37.99
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
  1. Free Chapter
    Chapter 1: Overview of Deep Learning using PyTorch
About this book
Deep learning is driving the AI revolution, and PyTorch is making it easier than ever before for anyone to build deep learning applications. This PyTorch book will help you uncover expert techniques to get the most out of your data and build complex neural network models. The book starts with a quick overview of PyTorch and explores using convolutional neural network (CNN) architectures for image classification. You'll then work with recurrent neural network (RNN) architectures and transformers for sentiment analysis. As you advance, you'll apply deep learning across different domains, such as music, text, and image generation using generative models and explore the world of generative adversarial networks (GANs). You'll not only build and train your own deep reinforcement learning models in PyTorch but also deploy PyTorch models to production using expert tips and techniques. Finally, you'll get to grips with training large models efficiently in a distributed manner, searching neural architectures effectively with AutoML, and rapidly prototyping models using PyTorch and fast.ai. By the end of this PyTorch book, you'll be able to perform complex deep learning tasks using PyTorch to build smart artificial intelligence models.
Publication date:
February 2021
Publisher
Packt
Pages
450
ISBN
9781789614381

 

Chapter 1: Overview of Deep Learning using PyTorch

Deep learning is a class of machine learning methods that has revolutionized the way computers/machines are used to perform cognitive tasks in real life. Based on the mathematical concept of deep neural networks, deep learning uses large amounts of data to learn non-trivial relationships between inputs and outputs in the form of complex nonlinear functions. Some of the inputs and outputs, as demonstrated in Figure 1.1, could be the following:

  • Input: An image of a text; output: Text
  • Input: Text; output: A natural voice speaking the text
  • Input: A natural voice speaking the text; output: Transcribed text

And so on. Here is a figure to support the preceding explanation:

Figure 1.1 – Deep learning model examples

Figure 1.1 – Deep learning model examples

Deep neural networks involve a lot of mathematical computations, linear algebraic equations, complex nonlinear functions, and various optimization algorithms. In order to build and train a deep neural network from scratch using a programming language such as Python, it would require us to write all the necessary equations, functions, and optimization schedules. Furthermore, the code would need to be written such that large amounts of data can be loaded efficiently, and training can be performed in a reasonable amount of time. This amounts to implementing several lower-level details each time we build a deep learning application.

Deep learning libraries such as Theano and TensorFlow, among various others, have been developed over the years to abstract these details out. PyTorch is one such Python-based deep learning library that can be used to build deep learning models.

TensorFlow was introduced as an open source deep learning Python (and C++) library by Google in late 2015, which revolutionized the field of applied deep learning. Facebook, in 2016, responded with its own open source deep learning library and called it Torch. Torch was initially used with a scripting language called Lua, and soon enough, the Python equivalent emerged called PyTorch. Around the same time, Microsoft released its own library – CNTK. Amidst the hot competition, PyTorch has been growing fast to become one of the most used deep learning libraries.

This book is meant to be a hands-on resource on some of the most advanced deep learning problems, how they are solved using complex deep learning architectures, and how PyTorch can be effectively used to build, train, and evaluate these complex models. While the book keeps PyTorch at the center, it also includes comprehensive coverage of some of the most recent and advanced deep learning models. The book is intended for data scientists, machine learning engineers, or researchers who have a working knowledge of Python and who, preferably, have used PyTorch before.

Due to the hands-on nature of this book, it is highly recommended to try the examples in each chapter by yourself on your computer to become proficient in writing PyTorch code. We begin with this introductory chapter and subsequently explore various deep learning problems and model architectures that will expose the various functionalities PyTorch has to offer.

This chapter will review some of the concepts behind deep learning and will provide a brief overview of the PyTorch library. We conclude this chapter with a hands-on exercise where we train a deep learning model using PyTorch.

The following topics will be covered in this chapter:

  • A refresher on deep learning
  • Exploring the PyTorch library
  • Training a neural network using PyTorch
 

Technical requirements

We will be using Jupyter notebooks for all of our exercises. And the following is the list of Python libraries that shall be installed for this chapter using pip. For example, run pip install torch==1.4.0 on the command line:

jupyter==1.0.0
torch==1.4.0
torchvision==0.5.0
matplotlib==3.1.2

All code files relevant to this chapter are available at https://github.com/PacktPublishing/Mastering-PyTorch/tree/master/Chapter01.

 

A refresher on deep learning

Neural networks are a sub-type of machine learning methods that are inspired by the structure and function of the human brain. In neural networks, each computational unit, analogically called a neuron, is connected to other neurons in a layered fashion. When the number of such layers is more than two, the neural network thus formed is called a deep neural network. Such models are generally called deep learning models.

Deep learning models have proven superior to other classical machine learning models because of their ability to learn highly complex relationships between input data and the output (ground truth). In recent times, deep learning has gained a lot of attention and rightly so, primarily because of the following two reasons:

  • The availability of powerful computing machines, especially in the cloud
  • The availability of huge amounts of data

Owing to Moore's law, which states that the processing power of computers will double every 2 years, we are now living in a time when deep learning models with several hundreds of layers can be trained within a realistic and reasonably short amount of time. At the same time, with the exponential increase in the use of digital devices everywhere, our digital footprint has exploded, resulting in gigantic amounts of data being generated across the world every moment.

Hence, it has been possible to train deep learning models for some of the most difficult cognitive tasks that were either intractable earlier or had sub-optimal solutions through other machine learning techniques.

Deep learning, or neural networks in general, has another advantage over the classical machine learning models. Usually, in a classical machine learning-based approach, feature engineering plays a crucial role in the overall performance of a trained model. However, a deep learning model does away with the need to manually craft features. With large amounts of data, deep learning models can perform very well without requiring hand-engineered features and can outperform the traditional machine learning models. The following graph indicates how deep learning models can leverage large amounts of data better than the classical machine models:

Figure 1.2 – Model performance versus dataset size

Figure 1.2 – Model performance versus dataset size

As can be seen in the graph, deep learning performance isn't necessarily distinguished up to a certain dataset size. However, as the data size starts to further increase, deep neural networks begin outperforming the non-deep learning models.

A deep learning model can be built based on various types of neural network architectures that have been developed over the years. A prime distinguishing factor between the different architectures is the type and combination of layers that are used in the neural network. Some of the well-known layers are the following:

  • Fully-connected or linear: In a fully connected layer, as shown in the following diagram, all neurons preceding this layer are connected to all neurons succeeding this layer:
Figure 1.3 – Fully connected layer

Figure 1.3 – Fully connected layer

This example shows two consecutive fully connected layers with N1 and N2 number of neurons, respectively. Fully connected layers are a fundamental unit of many – in fact, most – deep learning classifiers.

  • Convolutional: The following diagram shows a convolutional layer, where a convolutional kernel (or filter) is convolved over the input:
Figure 1.4 – Convolutional layer

Figure 1.4 – Convolutional layer

Convolutional layers are a fundamental unit of convolutional neural networks (CNNs), which are the most effective models for solving computer vision problems.

  • Recurrent: The following diagram shows a recurrent layer. While it looks similar to a fully connected layer, the key difference is the recurrent connection (marked with bold curved arrows):
Figure 1.5 – Recurrent layer

Figure 1.5 – Recurrent layer

Recurrent layers have an advantage over fully connected layers in that they exhibit memorizing capabilities, which comes in handy working with sequential data where one needs to remember past inputs along with the present inputs.

  • DeConv (the reverse of a convolutional layer): Quite the opposite of a convolutional layer, a deconvolutional layer works as shown in the following diagram:
Figure 1.6 – Deconvolutional layer

Figure 1.6 – Deconvolutional layer

This layer expands the input data spatially and hence is crucial in models that aim to generate or reconstruct images, for example.

  • Pooling: The following diagram shows the max-pooling layer, which is perhaps the most widely used kind of pooling layer:
Figure 1.7 – Pooling layer

Figure 1.7 – Pooling layer

This is a max-pooling layer that pools the highest number each from 2x2 sized subsections of the input. Other forms of pooling are min-pooling and mean-pooling.

  • Dropout: The following diagram shows how dropout layers work. Essentially, in a dropout layer, some neurons are temporarily switched off (marked with X in the diagram), that is, they are disconnected from the network:
Figure 1.8 – Dropout layer

Figure 1.8 – Dropout layer

Dropout helps in model regularization as it forces the model to function well in sporadic absences of certain neurons, which forces the model to learn generalizable patterns instead of memorizing the entire training dataset.

A number of well-known architectures based on the previously mentioned layers are shown in the following diagram:.

Figure 1.9 – Different neural network architectures

Figure 1.9 – Different neural network architectures

A more exhaustive set of neural network architectures can be found here: https://www.asimovinstitute.org/neural-network-zoo/.

Besides the types of layers and how they are connected in a network, other factors such as activation functions and the optimization schedule also define the model.

Activation functions

Activation functions are crucial to neural networks as they add the non-linearity without which, no matter how many layers we add, the entire neural network would be reduced to a simple linear model. The different types of activation functions listed here are basically different nonlinear mathematical functions.

Some of the popular activation functions are as follows:

  • Sigmoid: A sigmoid (or logistic) function is expressed as follows:

The function is shown in graph form as follows:

Figure 1.10 – Sigmoid function

Figure 1.10 – Sigmoid function

As can be seen from the graph, the sigmoid function takes in a numerical value x as input and outputs a value y in the range (0, 1).

  • TanH: TanH is expressed as follows:

The function is shown in graph form as follows:

Figure 1.11 – TanH function

Figure 1.11 – TanH function

Contrary to sigmoid, the output y varies from -1 to 1 in the case of the TanH activation function. Hence, this activation is useful in cases where we need both positive as well as negative outputs.

  • Rectified linear units (ReLUs): ReLUs are more recent than the previous two and are simply expressed as follows:

The function is shown in graph form as follows:

Figure 1.12 – ReLU function

Figure 1.12 – ReLU function

A distinct feature of ReLU in comparison with the sigmoid and TanH activation functions is that the output keeps growing with the input whenever the input is greater than 0. This prevents the gradient of this function from diminishing to 0 as in the case of the previous two activation functions. Although, whenever the input is negative, both the output and the gradient will be 0.

  • Leaky ReLU: ReLUs entirely suppress any incoming negative input by outputting 0. We may, however, want to also process negative inputs for some cases. Leaky ReLUs offer the option of processing negative inputs by outputting a fraction k of the incoming negative input. This fraction k is a parameter of this activation function, which can be mathematically expressed as follows:

The following graph shows the input-output relationship for leaky ReLU:

Figure 1.13 – Leaky ReLU function

Figure 1.13 – Leaky ReLU function

Activation functions are an actively evolving area of research within deep learning. It will not be possible to list all of the activation functions here but I encourage you to check out the recent developments in this domain. Many activation functions are simply nuanced modifications of the ones mentioned in this section.

Optimization schedule

So far, we have spoken of how a neural network structure is built. In order to train a neural network, we need to adopt an optimization schedule. Like any other parameter-based machine learning model, a deep learning model is trained by tuning its parameters. The parameters are tuned through the process of backpropagation, wherein the final or output layer of the neural network yields a loss. This loss is calculated with the help of a loss function that takes in the neural network's final layer's outputs and the corresponding ground truth target values. This loss is then backpropagated to the previous layers using gradient descent and the chain rule of differentiation.

The parameters or weights at each layer are accordingly modified in order to minimize the loss. The extent of modification is determined by a coefficient, which varies from 0 to 1, also known as the learning rate. This whole procedure of updating the weights of a neural network, which we call the optimization schedule, has a significant impact on how well a model is trained. Therefore, a lot of research has been done in this area and is still ongoing. The following are a few popular optimization schedules:

  • Stochastic Gradient Descent (SGD): It updates the model parameters in the following fashion:

β is the parameter of the model and X and y are the input training data and the corresponding labels respectively. L is the loss function and α is the learning rate. SGD performs this update for every training example pair (X, y). A variant of this –mini-batch gradient descent – performs updates for every k examples, where k is the batch size. Gradients are calculated altogether for the whole mini-batch. Another variant, batch gradient descent, performs parameter updates by calculating the gradient across the entire dataset.

  • Adagrad: In the previous optimization schedule, we used a single learning rate for all the parameters of the model. However, different parameters might need to be updated at different paces, especially in cases of sparse data, where some parameters are more actively involved in feature extraction than others. Adagrad introduces the idea of per-parameter updates, as shown here:

Here, we use the subscript i to denote the ith parameter and the superscript t is used to denote the time step t of the gradient descent iterations. SSGit is the sum of squared gradients for the ith parameter starting from time step 0 to time step t. є is used to denote a small value added to SSG to avoid division by zero. Dividing the global learning rate α by the square root of SSG ensures that the learning rate for frequently changing parameters lowers faster than the learning rate for rarely updated parameters.

  • Adadelta: In Adagrad, the denominator of the learning rate is a term that keeps on rising in value due to added squared terms in every time step. This causes the learning rates to decay to vanishingly small values. To tackle this problem, Adadelta introduces the idea of computing the sum of squared gradients only up to previous time steps. In fact, we can express it as a running decaying average of the past gradients:

γ here is the decaying factor we wish to choose for the previous sum of squared gradients. With this formulation, we ensure that the sum of squared gradients does not accumulate to a large value, thanks to the decaying average. Once SSGit is defined, we can use the Adagrad equation to define the update step for Adadelta.

However, if we look closely at the Adagrad equation, the root mean squared gradient is not a dimensionless quantity and hence should ideally not be used as a coefficient for the learning rate. To resolve this, we define another running average, this time for the squared parameter updates. Let's first define the parameter update:

And then, similar to the running decaying average of the past gradients equation (the first equation under Adadelta), we can define the square sum of parameter updates as follows:

Here, SSPU is the sum of squared parameter updates. Once we have this, we can adjust for the dimensionality problem in the Adagrad equation with the final Adadelta equation:

Noticeably, the final Adadelta equation doesn't require any learning rate. One can still however provide a learning rate as a multiplier. Hence, the only mandatory hyperparameter for this optimization schedule is the decaying factors..

  • RMSprop: We have implicitly discussed the internal workings of RMSprop while discussing Adadelta as both are pretty similar. The only difference is that RMSProp does not adjust for the dimensionality problem and hence the update equation stays the same as the equation presented in the Adagrad section, wherein the SSGit is obtained from the first equation in the Adadelta section. This essentially means that we do need to specify both a base learning rate as well as a decaying factor in the case of RMSProp.
  • Adaptive Moment Estimation (Adam): This is another optimization schedule that calculates customized learning rates for each parameter. Just like Adadelta and RMSprop, Adam also uses the decaying average of the previous squared gradients as demonstrated in the first equation in the Adadelta section. However, it also uses the decaying average of previous gradient values:

SG and SSG are mathematically equivalent to estimating the first and second moments of the gradient respectively, hence the name of this method – adaptive moment estimation. Usually, γ and γ' are close to 1 and in that case, the initial values for both SG and SSG might be pushed towards zero. To counteract that, these two quantities are reformulated with the help of bias correction:

and

Once they are defined, the parameter update is expressed as follows:

Basically, the gradient on the extreme right-hand side of the equation is replaced by the decaying average of the gradient. Noticeably, Adam optimization involves three hyperparameters – the base learning rate, and the two decaying rates for the gradients and squared gradients. Adam is one of the most successful, if not the most successful, optimization schedule in recent times for training complex deep learning models.

So, which optimizer shall we use? It depends. If we are dealing with sparse data, then the adaptive optimizers (numbers 2 to 5) will be advantageous because of the per-parameter learning rate updates. As mentioned earlier, with sparse data, different parameters might be worked at different paces and hence a customized per-parameter learning rate mechanism can greatly help the model in reaching optimal solutions. SGD might also find a decent solution but will take much longer in terms of training time. Among the adaptive ones, Adagrad has the disadvantage of vanishing learning rates due to a monotonically increasing learning rate denominator.

RMSProp, Adadelta, and Adam are quite close in terms of their performance on various deep learning tasks. RMSprop is largely similar to Adadelta, except for the use of the base learning rate in RMSprop versus the use of the decaying average of previous parameter updates in Adadelta. Adam is slightly different in that it also includes the first-moment calculation of gradients and accounts for bias correction. Overall, Adam could be the optimizer to go with, all else being equal. We will use some of these optimization schedules in the exercises in this book. Feel free to switch them with another one to observe changes in the following:

  • Model training time and trajectory (convergence)
  • Final model performance

In the coming chapters, we will use many of these architectures, layers, activation functions, and optimization schedules in solving different kinds of machine learning problems with the help of PyTorch. In the example included in this chapter, we will create a convolutional neural network that contains convolutional, linear, max-pooling, and dropout layers. Log-Softmax is used for the final layer and ReLU is used as the activation function for all the other layers. And the model is trained using an Adadelta optimizer with a fixed learning rate of 0.5.

     
About the Author
  • Ashish Ranjan Jha

    Ashish Ranjan Jha received his bachelor's degree in electrical engineering from IIT Roorkee (India), a master's degree in Computer Science from EPFL (Switzerland), and an MBA degree from Quantic School of Business (Washington). He has received a distinction in all 3 of his degrees. He has worked for large technology companies, including Oracle and Sony as well as the more recent tech unicorns such as Revolut, mostly focused on artificial intelligence. He currently works as a machine learning engineer. Ashish has worked on a range of products and projects, from developing an app that uses sensor data to predict the mode of transport to detecting fraud in car damage insurance claims. Besides being an author, machine learning engineer, and data scientist, he also blogs frequently on his personal blog site about the latest research and engineering topics around machine learning.

    Browse publications by this author
Mastering PyTorch
Unlock this book and the full library FREE for 7 days
Start now