Reader small image

You're reading from  Transformers for Natural Language Processing - Second Edition

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781803247335
Edition2nd Edition
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

Appendix II — Hardware Constraints for Transformer Models

Transformer models could not exist without optimized hardware. Memory and disk management design remain critical components. However, computing power remains a prerequisite. It would be nearly impossible to train the original Transformer described in Chapter 2, Getting Started with the Architecture of the Transformer Model, without GPUs. GPUs are at the center of the battle for efficient transformer models.

This appendix to Chapter 3, Fine-Tuning BERT Models, will take you through the importance of GPUs in three steps:

  • The architecture and scale of transformers
  • CPUs versus GPUs
  • Implementing GPUs in PyTorch as an example of how any other optimized language optimizes

The Architecture and Scale of Transformers

A hint about hardware-driven design appears in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model:

“However, we would only get one point of view at a time by analyzing the sequence with one dmodel block. Furthermore, it would take quite some calculation time to find other perspectives.

A better way is to divide the dmodel = 512 dimensions of each word xn of x (all the words of a sequence) into 8 dk = 64 dimensions.

We then can run the 8 “heads” in parallel to speed up the training and obtain 8 different representation subspaces of how each word relates to another:

Une image contenant table  Description générée automatiquement

Figure II.1: Multi-head representations

You can see that there are now 8 heads running in parallel.

We can easily see the motivation for forcing the attention heads to learn 8 different perspectives. However, digging deeper into the motivations of the...

Why GPUs are so special

A clue to GPU-driven design emerges in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model.

Attention is defined as “Scaled Dot-Product Attention,” which is represented in the following equation into which we plug Q, K, and V:

We can now conclude the following:

  • Attention heads are designed for parallel computing
  • Attention heads are based on matmul, matrix multiplication

GPUs are designed for parallel computing

A CPU (central processing unit) is optimized for serial processing. But if we run the attention heads through serial processing, it would take far longer to train an efficient transformer model. Very small educational transformers can run on CPUs. However, they do not qualify as state-of-the-art models.

A GPU (graphics processing unit) is designed for parallel processing. Transformer models were designed for parallel processing (GPUs), not serial processing (CPUs).

GPUs are also designed for matrix multiplication

NVIDIA GPUs, for example, contain tensor cores that accelerate matrix operations. A significant proportion of artificial intelligence algorithms use matrix operations, including transformer models. NVIDIA GPUs contain a goldmine of hardware optimization for matrix operations. The following links provide more information:

Google’s Tensor Processing Unit (TPU) is the equivalent of NVIDIA’s GPUs. TensorFlow will optimize the use of tensors when using TPUs.

BERTBASE (110M parameters) was initially trained with 16 TPU chips. BERTLARGE (340M parameters) was trained with 64 TPU chips. For more on training...

Implementing GPUs in code

PyTorch, among other languages and frameworks, manages GPUs. PyTorch contains tensors just as TensorFlow does. A tensor may look like NumPy np.arrays(). However, NumPy is not fit for parallel processing. Tensors use the parallel processing features of GPUs.

Tensors open the doors to distributed data over GPUs in PyTorch, among other frameworks: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

In the Chapter03 notebook, BERT_Fine_Tuning_Sentence_Classification_GPU.ipynb, we used CUDA (Compute Unified Device Architecture) to communicate with NVIDIA GPUs. CUDA is an NVIDIA platform for general computing on GPUs. Specific instructions can be added to our source code. For more, see https://developer.nvidia.com/cuda-zone.

In the Chapter03 notebook, we used CUDA instructions to transfer our model and data to NVIDIA GPUs. PyTorch has an instruction to specify the device we wish to use: torch.device.

For more, see https://pytorch.org/docs...

Testing GPUs with Google Colab

In this section, I describe informal tests I ran to illustrate the potential of GPUs. We’ll use the same Chapter03 notebook: BERT_Fine_Tuning_Sentence_Classification_GPU.ipynb.

I ran the notebook on three scenarios:

  • Google Colab Free with a CPU
  • Google Colab Free with a GPU
  • Google Colab Pro

Google Colab Free with a CPU

It is nearly impossible to fine-tune or train a transformer model with millions or billions of parameters on a CPU. CPUs are mostly sequential. Transformer models are designed for parallel processing.

In the Runtime menu and Change Runtime Type submenu, you can select a hardware accelerator: None (CPU), GPU, or TPU.

This test was run with None (CPU), as shown in Figure II.2:

Graphical user interface, text, application, chat or text message  Description automatically generated

Figure II.2: Selecting a hardware accelerator

When the notebook reaches the training loop, it slows down right from the start:

Figure II.3: Training loop

After 15 minutes, nothing has really happened.

CPUs are not designed for parallel processing. Transformer models are designed for parallel processing, so part from toy models, they require GPUs.

Google Colab Free with a GPU

Let’s go back to the notebook settings to select a GPU.

Une image contenant texte  Description générée automatiquement

Figure II.4 Selecting a GPU

At the time of writing, I tested Google Colab, and an NVIDIA...

Google Colab Pro with a GPU

The VM activated with Google Colab provided an NVIDIA P100 GPU, as shown in Figure II.7. That was interesting because the original Transformer was trained with 8 NVIDIA P100s as stated in Vaswani et al.(2017), Attention is All you Need. It took 12 hours to train the base models with 106×65 parameters and with 8 GPUs:

Table  Description automatically generated with medium confidence

Figure II.7: The Google Colab Pro VM was provided with a P100 GPU

The training loop time was considerably reduced and lasted less than 10 minutes, as shown in Figure II.8:

Text  Description automatically generated

Figure II.8: Training loop with a P100 GPU

Join our book’s Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:

https://www.packt.link/Transformers

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing - Second Edition
Published in: Mar 2022Publisher: PacktISBN-13: 9781803247335
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman