Reader small image

You're reading from  Deep Learning with PyTorch Lightning

Product typeBook
Published inApr 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800561618
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Kunal Sawarkar
Kunal Sawarkar
author image
Kunal Sawarkar

Kunal Sawarkar is a chief data scientist and AI thought leader. He leads the worldwide partner ecosystem in building innovative AI products. He also serves as an advisory board member and an angel investor. He holds a master's degree from Harvard University with major coursework in applied statistics. He has been applying machine learning to solve previously unsolved problems in industry and society, with a special focus on deep learning and self-supervised learning. Kunal has led various AI product R&D labs and has 20+ patents and papers published in this field. When not diving into data, he loves doing rock climbing and learning to fly aircraft, in addition to an insatiable curiosity for astronomy and wildlife.
Read more about Kunal Sawarkar

Right arrow

Chapter 10: Scaling and Managing Training

So far, we have been on an exciting journey in the realm of Deep Learning (DL). We have learned how to recognize images, how to create new images or generate new texts, and how to train machines without fully labeled sets. It's an open secret that achieving good results for a DL model requires a massive amount of compute power, often requiring the help of a Graphics Processing Unit (GPU). We have come a long way since the early days of DL when data scientists had to manually distribute the training to each node of the GPU. PyTorch Lightning obfuscates most of the complexities associated with managing underlying hardware or pushing down training to the GPU.

In the earlier chapters, we have pushed down training via brute force. However, doing so is not practical when you have to deal with a massive training effort for large-scale data. In this chapter, we will take a nuanced view of the challenges of training a model at scale and managing...

Technical Requirements

In this chapter, we will be using version 1.5.2 of PyTorch Lightning. Please install this version using below command

!pip install pytorch-lightning==1.5.2

Managing training

In this section, we will go through some of the common challenges that you may encounter while managing the training of DL models. This includes troubleshooting in terms of saving model parameters and debugging the model logic efficiently.

Saving model hyperparameters

There is often a need to save the model's hyperparameters. A few reasons are reproducibility, consistency, and that some models' network architecture are extremely sensitive to hyperparameters.

On more than one occasion, you may find yourself being unable to load the model from the checkpoint. The load_from_checkpoint method of the LightningModule class fails with an error.

Solution

A checkpoint is nothing more than a saved state of the model. Checkpoints contain precise values of all parameters used by the model. However, hyperparameter arguments passed to the __init__ model are not saved in the checkpoint by default. Calling self.save_hyperparameters inside __init__ of the...

Scaling up training

Scaling up training requires us to speed up the training process for large amounts of data and utilize GPUs and TPUs better. In this section, we will cover some of the tips on how to efficiently use provisions in PyTorch Lightning to accomplish this.

Speeding up model training using a number of workers

How can the PyTorch Lightning framework help speed up model training? One useful parameter to know is num_workers, which comes from PyTorch, and PyTorch Lightning builds on top of it by giving advice about the number of workers.

Solution

The PyTorch Lightning framework offers a number of provisions for speeding up model training, such as the following:

  • You can set a non-zero value for the num_workers argument to speed up model training. The following code snippet provides an example of this:
    import torch.utils.data as data
    ...
    dataloader = data.DataLoader(num_workers=4, ...)

The optimal num_workers value depends on the batch size and configuration...

Controlling training

There is often a need to have an audit, balance, and control mechanism during the training process. Imagine you are training a model for 1,000 epochs and a network failure causes an interruption after 500 epochs. How do you resume training from a certain point while ensuring that you won't lose all your progress, or save a model checkpoint from a cloud environment? Let's see how to deal with these practical challenges that are often part and parcel of an engineer's life.

Saving model checkpoints when using the cloud

Notebooks hosted in cloud environments such as Google Colab have resource limits and idle timeout periods. If these limits are exceeded during the development of a model, then the notebook is deactivated. Owing to the inherently elastic nature of the cloud environment, (which is one of the value propositions of the cloud) the underlying compute and storage resources are decommissioned when a notebook is deactivated. If you refresh...

Further reading

We have mentioned some key tips and tricks that we have found useful for common troubleshooting. You can always refer to the Speed up model training documentation for more details on how to speed up training or on other topics. Here is a link to the documentation: https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html.

We have described how PyTorch Lightning supports the TensorBoard logging framework by default. Here is a link to the TensorBoard website: https://www.tensorflow.org/tensorboard.

Additionally, PyTorch Lightning supports CometLogger, CSVLogger, MLflowLogger, and other logging frameworks. You can refer to the Logging documentation for details of how those other logger types can be enabled. Here is a link to the documentation: https://pytorch-lightning.readthedocs.io/en/stable/extensions/logging.html.

Summary

We began this book with just a curiosity for what DL and PyTorch Lightning are. Anyone new to the Deep Learning or a curious beginner to PyTorch Lightning can get their feet wet by trying simple image recognition models and then continue to raise their game by learning skills such as Transfer Learning (TL) or how to make use of other pre-trained architectures. We continued to leverage the PyTorch Lightning framework for doing not just image recognition models but also Natural Language Processing (NLP) models, time series, and other traditional Machine Learning (ML) challenges. Along the way, we learned about RNN, LSTM, and Transformers.

In the next section of the book, we explored exotic DL models such as Generative Adversarial Networks (GANs), Semi-supervised learning, and Self-Supervised Learning that expand the art of what is possible in the domain of ML and these are not just advanced models but super cool ways to create art and lots of fun to work with. We wrapped...

Why subscribe?

  • Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
  • Improve your learning with Skill Plans built especially for you
  • Get a free eBook or video every month
  • Fully searchable for easy access to vital information
  • Copy and paste, print, and bookmark content

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with PyTorch Lightning
Published in: Apr 2022Publisher: PacktISBN-13: 9781800561618
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Kunal Sawarkar

Kunal Sawarkar is a chief data scientist and AI thought leader. He leads the worldwide partner ecosystem in building innovative AI products. He also serves as an advisory board member and an angel investor. He holds a master's degree from Harvard University with major coursework in applied statistics. He has been applying machine learning to solve previously unsolved problems in industry and society, with a special focus on deep learning and self-supervised learning. Kunal has led various AI product R&D labs and has 20+ patents and papers published in this field. When not diving into data, he loves doing rock climbing and learning to fly aircraft, in addition to an insatiable curiosity for astronomy and wildlife.
Read more about Kunal Sawarkar