Reader small image

You're reading from  Accelerate Model Training with PyTorch 2.X

Product typeBook
Published inApr 2024
Reading LevelIntermediate
PublisherPackt
ISBN-139781805120100
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Maicon Melo Alves
Maicon Melo Alves
author image
Maicon Melo Alves

Dr. Maicon Melo Alves is a senior system analyst and academic professor specialized in High Performance Computing (HPC) systems. In the last five years, he got interested in understanding how HPC systems have been used to leverage Artificial Intelligence applications. To better understand this topic, he completed in 2021 the MBA in Data Science of Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure and, since 2006, he works with HPC systems at Petrobras, the Brazilian energy state company. He obtained his D.Sc. degree in Computer Science from the Fluminense Federal University (UFF) in 2018 and possesses three published books and publications in international journals of HPC area.
Read more about Maicon Melo Alves

Right arrow

Building an Efficient Data Pipeline

Machine learning is grounded on data. Simply put, the training process feeds the neural network with a bunch of data, such as images, videos, sound, and text. Thus, apart from the training algorithm itself, data loading is an essential part of the entire model-building process.

It turns out that deep learning models deal with huge amounts of data, such as thousands of images and terabytes of text sequences. As a consequence, tasks related to data loading, preparation, and augmentation can severely delay the training process as a whole. So, to overcome a potential bottleneck in the model-building process, we must guarantee an uninterrupted flow of dataset samples to the training process.

In this chapter, we’ll explain how to build an efficient data pipeline to keep the training process running smoothly. The main idea is to prevent the training process from being stalled by data-related tasks.

Here is what you will learn as part of...

Technical requirements

You can find the complete code examples mentioned in this chapter in this book’s GitHub repository at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main.

You can access your favorite environment to execute this notebook, such as Google Colab or Kaggle.

Why do we need an efficient data pipeline?

We’ll start this chapter by making you aware of the relevance of having an efficient data pipeline. In the next few subsections, you will understand what a data pipeline is and how it can impact the performance of the training process.

What is a data pipeline?

As you learned in Chapter 1, Deconstructing the Training Process, the training process is composed of four phases: forward, loss calculation, optimization, and backward. The training algorithm iterates on dataset samples until there’s a complete epoch. Nevertheless, there is an additional phase we excluded from that explanation: data loading.

The forward phase invokes data loading to get dataset samples to execute the training process. More specifically, the forward phase calls the data loading process on each iteration to get the data required to execute the current training step, as shown in Figure 5.1:

Figure 5.1 – Data loading process

Figure 5.1 – Data loading...

Accelerating data loading

Accelerating data loading is crucial to get an efficient data pipeline. In general, the following two changes are enough to get the work done:

  • Optimizing a data transfer between the CPU and GPU
  • Increasing the number of workers in the data pipeline

Putting it that way, these changes may sound tougher to implement than they are. Making these changes is quite simple – we just need to add a couple of parameters when creating the DataLoader instance for the data pipeline. We will cover this in the following subsections.

Optimizing a data transfer to the GPU

To transfer data from main memory to the GPU, and vice versa, the device driver must ask the operating system to pin or lock a portion of memory. After receiving access to that pinned memory, the device driver starts to copy data from the original memory location to the GPU, but using the pinned memory as a staging area:

Figure 5.6 – Data transfer between main memory and GPU

Figure 5.6 – Data transfer...

Quiz time!

Let’s review what we have learned in this chapter by answering a few questions. Initially, try to answer these questions without consulting the material.

Note

The answers to all these questions are available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/quiz/chapter05-answers.md.

Before starting this quiz, remember that this is not a test! This section aims to complement your learning process by revising and consolidating the content covered in this chapter.

Choose the correct options for the following questions:

  1. What three main tasks are executed during the data loading process?
    1. Loading, scaling, and resizing.
    2. Scaling, resizing, and loading.
    3. Resizing, loading, and filtering.
    4. Loading, preparation, and augmentation.
  2. Data loading feeds which phase of the training process?
    1. Forward.
    2. Backward.
    3. Optimization.
    4. Loss calculation.
  3. Which components provided by the torch.utils.data API can be used to implement a...

Summary

In this chapter, you learned that the data pipeline is an important piece of the model-building process. Thus, an efficient data pipeline is essential to keep the training process running without interruptions. Besides optimizing data transfer to the GPU through memory pining, you have learned how to enable and configure a multi-worker data pipeline.

In the next chapter, you will learn how to reduce model complexity to speed up the training process without penalizing model quality.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Accelerate Model Training with PyTorch 2.X
Published in: Apr 2024Publisher: PacktISBN-13: 9781805120100
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Maicon Melo Alves

Dr. Maicon Melo Alves is a senior system analyst and academic professor specialized in High Performance Computing (HPC) systems. In the last five years, he got interested in understanding how HPC systems have been used to leverage Artificial Intelligence applications. To better understand this topic, he completed in 2021 the MBA in Data Science of Pontifícia Universidade Católica of Rio de Janeiro (PUC-RIO). He has over 25 years of experience in IT infrastructure and, since 2006, he works with HPC systems at Petrobras, the Brazilian energy state company. He obtained his D.Sc. degree in Computer Science from the Fluminense Federal University (UFF) in 2018 and possesses three published books and publications in international journals of HPC area.
Read more about Maicon Melo Alves