You're reading from Accelerate Model Training with PyTorch 2.X

Product typeBook

Published inApr 2024

Reading LevelIntermediate

PublisherPackt

ISBN-139781805120100

Edition1st Edition

Languages

Python

Tools

PyTorch

Concepts

Machine Learning

Author (1)

Maicon Melo Alves

Distributed Training at a Glance

When we face a complex problem in real life, we usually try to solve it by dividing the big problem into small parts that are easier to treat. So, by combining the partial solutions obtained from the small pieces of the original problem, we reach the final solution. This strategy, called divide and conquer, is frequently used to solve computational tasks. We can say that this approach is the basis of the parallel and distributed computing areas.

It turns out that this idea of dividing a big problem into small pieces comes in handy to accelerate the training process of complex models. In cases where using a single resource is not enough to train the model in a reasonable time, the unique way out relies on breaking down the training process and spreading it across multiple resources. In other words, we need to distribute the training process.

Here is what you will learn as part of this chapter:

The basic concepts of distributed training

Technical requirements

You can find the complete code examples mentioned in this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.

You can access your favorite environment to execute the code provided, such as Google Colab or Kaggle.

A first look at distributed training

We’ll start this chapter by discussing the reasons for distributing the training process among multiple resources. Then, we’ll learn what resources are commonly used to execute this process.

When do we need to distribute the training process?

The most common reason to distribute the training process concerns accelerating the building process. Suppose the training process is taking a long time to complete, and we have multiple resources at hand. In that case, we should consider distributing the training process among these various resources to reduce the training time.

The second motivation for going distributed is related to memory leaks to load a large model in a single resource. In this situation, we rely on distributed training to allocate different parts of the large model into distinct devices or resources so that the model can be loaded into the system.

However, distributed training is not a silver bullet that solves...

Learning the fundamentals of parallelism strategies

In the previous section, we learned that the distributed training approach divides the whole training process into small parts. As a result, the entire training process can be solved in parallel because each of these small parts is executed simultaneously in distinct computing resources.

The parallelism strategy defines how to divide the training process into small parts. There are two main parallelism strategies: model and data parallelism. The following sections explain both.

Model parallelism

Model parallelism divides the set of operations that are executed during the training process into smaller subsets of computing tasks. By doing this, the distributed process can run these smaller subsets of operations in distinct computing resources, thus accelerating the entire training process.

It turns out that operations executed in the forward and backward phases are not independent of each other. In other words, the execution...

Distributed training on PyTorch

This section introduces the basic workflow to implement distributed training on PyTorch, besides presenting the components used in this process.

Basic workflow

Generally speaking, the basic workflow to implement distributed training on PyTorch comprises the steps illustrated in Figure 8.14:

Figure 8.14 – Basic workflow to implement distributed training in PyTorch

Let’s look at each step in more detail.

Note

The complete code shown in this section is available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/code/chapter08/pytorch_ddp.py.

Initialize and destroy the communication group

The communication group is the logical entity that’s used by PyTorch to define and control the distributed environment. So, the first step to code the distributed training concerns initializing a communication group. This step is performed by instantiating an object...

Quiz time!

Let’s review what we have learned in this chapter by answering a few questions. Initially, try to answer these questions without consulting the material.

Note

The answers to all these questions are available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/quiz/chapter08-answers.md.

Before starting the quiz, remember that this is not a test! This section aims to complement your learning process by revising and consolidating the content covered in this chapter.

Choose the correct option for the following questions.

What are the two main reasons for distributing the training process?
1. Reliability and performance improvement.
2. Leak of memory and power consumption.
3. Power consumption and performance improvement.
4. Leak of memory and performance improvement.
Which are the two main parallel strategies to distribute the training process?
1. Model and data parallelism.
2. Model and hardware parallelism.
3. Hardware and data parallelism...

Summary

In this chapter, you learned that distributed training is indicated to accelerate the training process and training models that do not fit on a device’s memory. Although going distributed can be a way out for both cases, we must consider applying performance improvement techniques before going distributed.

We can perform distributed training by adopting the model parallelism or data parallelism strategy. The former employs different paradigms to divide the model computation among multiple computing resources, while the latter creates model replicas to be trained over chunks of the training dataset.

We also learned that PyTorch relies on third-party components such as communication backends and program launchers to execute the distributed training process.

In the next chapter, we will learn how to spread out the distributed training process so that it can run on multiple CPUs located in a single machine.

The rest of the chapter is locked

You have been reading a chapter from

Accelerate Model Training with PyTorch 2.X

Published in: Apr 2024Publisher: PacktISBN-13: 9781805120100

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Maicon Melo Alves

Dr. Maicon Melo Alves is a senior system analyst and academic professor specialized in High Performance Computing (HPC) systems. In the last five years, he got interested in understanding how HPC systems have been used to leverage Artificial Intelligence applications. To better understand this topic, he completed in 2021 the MBA in Data Science of Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure and, since 2006, he works with HPC systems at Petrobras, the Brazilian energy state company. He obtained his D.Sc. degree in Computer Science from the Fluminense Federal University (UFF) in 2018 and possesses three published books and publications in international journals of HPC area.
Read more about Maicon Melo Alves

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages