You're reading from Transformers for Natural Language Processing - Second Edition

Product typeBook

Published inMar 2022

PublisherPackt

ISBN-139781803247335

Edition2nd Edition

Concepts

Mobile Application Development

Author (1)

Denis Rothman

Appendix II — Hardware Constraints for Transformer Models

Transformer models could not exist without optimized hardware. Memory and disk management design remain critical components. However, computing power remains a prerequisite. It would be nearly impossible to train the original Transformer described in Chapter 2, Getting Started with the Architecture of the Transformer Model, without GPUs. GPUs are at the center of the battle for efficient transformer models.

This appendix to Chapter 3, Fine-Tuning BERT Models, will take you through the importance of GPUs in three steps:

The architecture and scale of transformers
CPUs versus GPUs
Implementing GPUs in PyTorch as an example of how any other optimized language optimizes

The Architecture and Scale of Transformers

A hint about hardware-driven design appears in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model:

“However, we would only get one point of view at a time by analyzing the sequence with one d_model block. Furthermore, it would take quite some calculation time to find other perspectives.

A better way is to divide the d_model = 512 dimensions of each word x_n of x (all the words of a sequence) into 8 d_k = 64 dimensions.

We then can run the 8 “heads” in parallel to speed up the training and obtain 8 different representation subspaces of how each word relates to another:

Une image contenant table Description générée automatiquement

Figure II.1: Multi-head representations

You can see that there are now 8 heads running in parallel.

We can easily see the motivation for forcing the attention heads to learn 8 different perspectives. However, digging deeper into the motivations of the...

Why GPUs are so special

A clue to GPU-driven design emerges in the The architecture of multi-head attention section of Chapter 2, Getting Started with the Architecture of the Transformer Model.

Attention is defined as “Scaled Dot-Product Attention,” which is represented in the following equation into which we plug Q, K, and V:

We can now conclude the following:

Attention heads are designed for parallel computing
Attention heads are based on matmul, matrix multiplication

GPUs are designed for parallel computing

A CPU (central processing unit) is optimized for serial processing. But if we run the attention heads through serial processing, it would take far longer to train an efficient transformer model. Very small educational transformers can run on CPUs. However, they do not qualify as state-of-the-art models.

A GPU (graphics processing unit) is designed for parallel processing. Transformer models were designed for parallel processing (GPUs), not serial processing (CPUs).

GPUs are also designed for matrix multiplication

NVIDIA GPUs, for example, contain tensor cores that accelerate matrix operations. A significant proportion of artificial intelligence algorithms use matrix operations, including transformer models. NVIDIA GPUs contain a goldmine of hardware optimization for matrix operations. The following links provide more information:

Google’s Tensor Processing Unit (TPU) is the equivalent of NVIDIA’s GPUs. TensorFlow will optimize the use of tensors when using TPUs.

For more on TPUs, see https://cloud.google.com/tpu/docs/tpus.
For more on tensors in TensorFlow, see https://www.tensorflow.org/guide/tensor.

BERT_BASE (110M parameters) was initially trained with 16 TPU chips. BERT_LARGE (340M parameters) was trained with 64 TPU chips. For more on training...

Implementing GPUs in code

PyTorch, among other languages and frameworks, manages GPUs. PyTorch contains tensors just as TensorFlow does. A tensor may look like NumPy np.arrays(). However, NumPy is not fit for parallel processing. Tensors use the parallel processing features of GPUs.

Tensors open the doors to distributed data over GPUs in PyTorch, among other frameworks: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

In the Chapter03 notebook, BERT_Fine_Tuning_Sentence_Classification_GPU.ipynb, we used CUDA (Compute Unified Device Architecture) to communicate with NVIDIA GPUs. CUDA is an NVIDIA platform for general computing on GPUs. Specific instructions can be added to our source code. For more, see https://developer.nvidia.com/cuda-zone.

In the Chapter03 notebook, we used CUDA instructions to transfer our model and data to NVIDIA GPUs. PyTorch has an instruction to specify the device we wish to use: torch.device.

For more, see https://pytorch.org/docs...

Testing GPUs with Google Colab

In this section, I describe informal tests I ran to illustrate the potential of GPUs. We’ll use the same Chapter03 notebook: BERT_Fine_Tuning_Sentence_Classification_GPU.ipynb.

I ran the notebook on three scenarios:

Google Colab Free with a CPU
Google Colab Free with a GPU
Google Colab Pro

Google Colab Free with a CPU

It is nearly impossible to fine-tune or train a transformer model with millions or billions of parameters on a CPU. CPUs are mostly sequential. Transformer models are designed for parallel processing.

In the Runtime menu and Change Runtime Type submenu, you can select a hardware accelerator: None (CPU), GPU, or TPU.

This test was run with None (CPU), as shown in Figure II.2:

Graphical user interface, text, application, chat or text message Description automatically generated

Figure II.2: Selecting a hardware accelerator

When the notebook reaches the training loop, it slows down right from the start:

Figure II.3: Training loop

After 15 minutes, nothing has really happened.

CPUs are not designed for parallel processing. Transformer models are designed for parallel processing, so part from toy models, they require GPUs.

Google Colab Free with a GPU

Let’s go back to the notebook settings to select a GPU.

Une image contenant texte Description générée automatiquement

Figure II.4 Selecting a GPU

At the time of writing, I tested Google Colab, and an NVIDIA...

Google Colab Pro with a GPU

The VM activated with Google Colab provided an NVIDIA P100 GPU, as shown in Figure II.7. That was interesting because the original Transformer was trained with 8 NVIDIA P100s as stated in Vaswani et al.(2017), Attention is All you Need. It took 12 hours to train the base models with 10⁶×65 parameters and with 8 GPUs:

Table Description automatically generated with medium confidence

Figure II.7: The Google Colab Pro VM was provided with a P100 GPU

The training loop time was considerably reduced and lasted less than 10 minutes, as shown in Figure II.8:

Text Description automatically generated

Figure II.8: Training loop with a P100 GPU

Join our book’s Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:

https://www.packt.link/Transformers

The rest of the chapter is locked

You have been reading a chapter from

Transformers for Natural Language Processing - Second Edition

Published in: Mar 2022Publisher: PacktISBN-13: 9781803247335

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages