Reader small image

You're reading from  Accelerate Model Training with PyTorch 2.X

Product typeBook
Published inApr 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781805120100
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Maicon Melo Alves
Maicon Melo Alves
author image
Maicon Melo Alves

Dr. Maicon Melo Alves is a senior system analyst and academic professor specialized in High Performance Computing (HPC) systems. In the last five years, he got interested in understanding how HPC systems have been used to leverage Artificial Intelligence applications. To better understand this topic, he completed in 2021 the MBA in Data Science of Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure and, since 2006, he works with HPC systems at Petrobras, the Brazilian energy state company. He obtained his D.Sc. degree in Computer Science from the Fluminense Federal University (UFF) in 2018 and possesses three published books and publications in international journals of HPC area.
Read more about Maicon Melo Alves

Right arrow

Training with Multiple CPUs

When accelerating the model-building process, we immediately think of machines endowed with GPU devices. What if I told you that running distributed training on machines equipped only with multicore processors is possible and advantageous?

Although the performance improvement obtained from GPUs is incomparable, we should not disdain the computing power provided by modern CPUs. Processor vendors have continuously increased the number of computing cores on CPUs, besides creating sophisticated mechanisms to treat access contention to shared resources.

Using CPUs to run distributed training is especially interesting for cases where we do not have easy access to GPU devices. Thus, learning this topic is vital to enrich our knowledge about distributed training.

In this chapter, we show how to execute the distributed training process on multiple CPUs in a single machine by adopting a general approach and using the Intel oneCCL backend.

Here is what...

Technical requirements

You can find the complete code of examples mentioned in this chapter in the book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.

You can access your favorite environments to execute this notebook, such as Google Colab or Kaggle.

Why distribute the training on multiple CPUs?

At first sight, thinking about distributing the training process among multiple CPUs in a single machine sounds slightly confusing. After all, we could increase the number of threads used in the training process to allocate all available CPUs (computing cores).

However, as said by Carlos Drummond de Andrade, a famous Brazilian poet, “In the middle of the road there was a stone. There was a stone in the middle of the road.” Let’s see what happens to the training process when we just increase the number of threads in a machine with multiple cores.

Why not increase the number of threads?

In Chapter 4, Using Specialized Libraries, we learned that PyTorch relies on OpenMP to accelerate the training process by employing the multithreading technique. OpenMP assigns threads to physical cores intending to improve the performance of the training process.

So, if we have a certain number of available computing cores...

Implementing distributed training on multiple CPUs

This section shows how to implement and run the distributed training on multiple CPUs using Gloo, a simple yet powerful communication backend.

The Gloo communication backend

In Chapter 8, Distributed Training at a Glance, we learned PyTorch relies on backends to control the communication among devices and machines involved in distributed training.

The most basic communication backend supported by PyTorch is called Gloo. This backend comes with PyTorch by default and does not require any particular configuration. The Gloo backend is a collective communication library created by Facebook, and it is now an open-source project governed by the BSD license.

Note

You can find the source code of Gloo at http://github.com/facebookincubator/gloo.

As Gloo is very simple to use and is available by default on PyTorch, it appears to be the first option to run the distributed training in an environment comprising only CPUs and machines...

Getting faster with Intel oneCCL

The results shown in Table 9.2 attest that Gloo fulfills the role of the communication backend for the distributed training process in PyTorch very well.

Even so, there is another option for the communication backend to go even faster on Intel platforms: the Intel oneCCL collective communication library. In this section, we will learn what this library is and how to use it as a communication backend for PyTorch.

What is Intel oneCCL?

Intel oneCCL is a collective communication library created and maintained by Intel. Along the lines of Gloo, oneCCL also provides collective communication primitives such as the so-called “All-reduce.”

Naturally, Intel oneCCL is optimized to run on Intel platform environments, though this does not necessarily mean it will not work on other platforms. We can use this library to provide collective communication among the processes executing in the same machine (intraprocess communication) or the...

Quiz time!

Let’s review what we have learned in this chapter by answering a few questions. At first, try to answer these questions without consulting the material.

Note

The answers to all these questions are available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/quiz/chapter09-answers.md.

Before starting the quiz, remember that it is not a test at all! This section aims to complement your learning process by revising and consolidating the content covered in this chapter.

Choose the correct option for the following questions.

  1. In multicore systems, we can improve the performance of the training process by increasing the number of threads used by PyTorch. Concerning this topic, we can affirm which of the following?
    1. After crossing a certain number of threads, the performance improvement can deteriorate or stay the same.
    2. The performance improvement always keeps rising, no matter the number of threads.
    3. There is no performance...

Summary

In this chapter, we learned that distributing the training process on multiple computing cores can be more advantageous than increasing the number of threads used in traditional training. This happens because PyTorch can face a limit on the parallelism level employed in the regular training process.

To distribute the training among multiple computing cores located in a single machine, we can use Gloo, a simple communication backend that comes by default with PyTorch. The results showed that the distributed training with Gloo achieved a performance improvement of 25% while retaining the same model accuracy.

We also learned that oneCCL, an Intel collective communication library, can accelerate the training process even more when executed on Intel platforms. With Intel oneCCL as the communication backend, we reduced the training time by more than 40%. If we are willing to reduce the model accuracy a little bit, it is possible to train the model two times faster.

In the...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Accelerate Model Training with PyTorch 2.X
Published in: Apr 2024Publisher: PacktISBN-13: 9781805120100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Maicon Melo Alves

Dr. Maicon Melo Alves is a senior system analyst and academic professor specialized in High Performance Computing (HPC) systems. In the last five years, he got interested in understanding how HPC systems have been used to leverage Artificial Intelligence applications. To better understand this topic, he completed in 2021 the MBA in Data Science of Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure and, since 2006, he works with HPC systems at Petrobras, the Brazilian energy state company. He obtained his D.Sc. degree in Computer Science from the Fluminense Federal University (UFF) in 2018 and possesses three published books and publications in international journals of HPC area.
Read more about Maicon Melo Alves