You're reading from Accelerate Model Training with PyTorch 2.X

Product type Book

Published in Apr 2024

Publisher Packt

ISBN-13 9781805120100

Pages 230 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Maicon Melo Alves

Table of Contents (17) Chapters

Preface

1. Part 1: Paving the Way

2. Chapter 1: Deconstructing the Training Process

3. Chapter 2: Training Models Faster

4. Part 2: Going Faster

5. Chapter 3: Compiling the Model

6. Chapter 4: Using Specialized Libraries

7. Chapter 5: Building an Efficient Data Pipeline

8. Chapter 6: Simplifying the Model

9. Chapter 7: Adopting Mixed Precision

10. Part 3: Going Distributed

11. Chapter 8: Distributed Training at a Glance

12. Chapter 9: Training with Multiple CPUs

13. Chapter 10: Training with Multiple GPUs

14. Chapter 11: Training with Multiple Machines

15. Index

Why subscribe?

16. Other Books You May Enjoy

Training with Multiple CPUs

When accelerating the model-building process, we immediately think of machines endowed with GPU devices. What if I told you that running distributed training on machines equipped only with multicore processors is possible and advantageous?

Although the performance improvement obtained from GPUs is incomparable, we should not disdain the computing power provided by modern CPUs. Processor vendors have continuously increased the number of computing cores on CPUs, besides creating sophisticated mechanisms to treat access contention to shared resources.

Using CPUs to run distributed training is especially interesting for cases where we do not have easy access to GPU devices. Thus, learning this topic is vital to enrich our knowledge about distributed training.

In this chapter, we show how to execute the distributed training process on multiple CPUs in a single machine by adopting a general approach and using the Intel oneCCL backend.

Here is what...

Technical requirements

You can find the complete code of examples mentioned in this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.

You can access your favorite environments to execute this notebook, such as Google Colab or Kaggle.

Why distribute the training on multiple CPUs?

At first sight, thinking about distributing the training process among multiple CPUs in a single machine sounds slightly confusing. After all, we could increase the number of threads used in the training process to allocate all available CPUs (computing cores).

However, as said by Carlos Drummond de Andrade, a famous Brazilian poet, “In the middle of the road there was a stone. There was a stone in the middle of the road.” Let’s see what happens to the training process when we just increase the number of threads in a machine with multiple cores.

Why not increase the number of threads?

In Chapter 4, Using Specialized Libraries, we learned that PyTorch relies on OpenMP to accelerate the training process by employing the multithreading technique. OpenMP assigns threads to physical cores intending to improve the performance of the training process.

So, if we have a certain number of available computing cores...

Implementing distributed training on multiple CPUs

This section shows how to implement and run the distributed training on multiple CPUs using Gloo, a simple yet powerful communication backend.

The Gloo communication backend

In Chapter 8, Distributed Training at a Glance, we learned PyTorch relies on backends to control the communication among devices and machines involved in distributed training.

The most basic communication backend supported by PyTorch is called Gloo. This backend comes with PyTorch by default and does not require any particular configuration. The Gloo backend is a collective communication library created by Facebook, and it is now an open-source project governed by the BSD license.

Note

You can find the source code of Gloo at http://github.com/facebookincubator/gloo.

As Gloo is very simple to use and is available by default on PyTorch, it appears to be the first option to run the distributed training in an environment comprising only CPUs and machines...

Getting faster with Intel oneCCL

The results shown in Table 9.2 attest that Gloo fulfills the role of the communication backend for the distributed training process in PyTorch very well.

Even so, there is another option for the communication backend to go even faster on Intel platforms: the Intel oneCCL collective communication library. In this section, we will learn what this library is and how to use it as a communication backend for PyTorch.

What is Intel oneCCL?

Intel oneCCL is a collective communication library created and maintained by Intel. Along the lines of Gloo, oneCCL also provides collective communication primitives such as the so-called “All-reduce.”

Naturally, Intel oneCCL is optimized to run on Intel platform environments, though this does not necessarily mean it will not work on other platforms. We can use this library to provide collective communication among the processes executing in the same machine (intraprocess communication) or the...

Quiz time!

Let’s review what we have learned in this chapter by answering a few questions. At first, try to answer these questions without consulting the material.

Note

The answers to all these questions are available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/quiz/chapter09-answers.md.

Before starting the quiz, remember that it is not a test at all! This section aims to complement your learning process by revising and consolidating the content covered in this chapter.

Choose the correct option for the following questions.

In multicore systems, we can improve the performance of the training process by increasing the number of threads used by PyTorch. Concerning this topic, we can affirm which of the following?
1. After crossing a certain number of threads, the performance improvement can deteriorate or stay the same.
2. The performance improvement always keeps rising, no matter the number of threads.
3. There is no performance...

Summary

In this chapter, we learned that distributing the training process on multiple computing cores can be more advantageous than increasing the number of threads used in traditional training. This happens because PyTorch can face a limit on the parallelism level employed in the regular training process.

To distribute the training among multiple computing cores located in a single machine, we can use Gloo, a simple communication backend that comes by default with PyTorch. The results showed that the distributed training with Gloo achieved a performance improvement of 25% while retaining the same model accuracy.

We also learned that oneCCL, an Intel collective communication library, can accelerate the training process even more when executed on Intel platforms. With Intel oneCCL as the communication backend, we reduced the training time by more than 40%. If we are willing to reduce the model accuracy a little bit, it is possible to train the model two times faster.

In the...

The rest of the chapter is locked

You're reading from Accelerate Model Training with PyTorch 2.X

Table of Contents (17) Chapters

Training with Multiple CPUs

Technical requirements

Why distribute the training on multiple CPUs?

Why not increase the number of threads?

Implementing distributed training on multiple CPUs

The Gloo communication backend

Getting faster with Intel oneCCL

What is Intel oneCCL?

Quiz time!

Summary

Authors (1)

Personalised recommendations for you

You're reading from Accelerate Model Training with PyTorch 2.X

Table of Contents (17) Chapters

Training with Multiple CPUs

Technical requirements

Why distribute the training on multiple CPUs?

Why not increase the number of threads?

Implementing distributed training on multiple CPUs

The Gloo communication backend

Getting faster with Intel oneCCL

What is Intel oneCCL?

Quiz time!

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you