Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Transformers for Natural Language Processing - Second Edition

You're reading from  Transformers for Natural Language Processing - Second Edition

Product type Book
Published in Mar 2022
Publisher Packt
ISBN-13 9781803247335
Pages 602 pages
Edition 2nd Edition
Languages
Author (1):
Denis Rothman Denis Rothman
Profile icon Denis Rothman

Table of Contents (25) Chapters

Preface 1. What are Transformers? 2. Getting Started with the Architecture of the Transformer Model 3. Fine-Tuning BERT Models 4. Pretraining a RoBERTa Model from Scratch 5. Downstream NLP Tasks with Transformers 6. Machine Translation with the Transformer 7. The Rise of Suprahuman Transformers with GPT-3 Engines 8. Applying Transformers to Legal and Financial Documents for AI Text Summarization 9. Matching Tokenizers and Datasets 10. Semantic Role Labeling with BERT-Based Transformers 11. Let Your Data Do the Talking: Story, Questions, and Answers 12. Detecting Customer Emotions to Make Predictions 13. Analyzing Fake News with Transformers 14. Interpreting Black Box Transformer Models 15. From NLP to Task-Agnostic Transformer Models 16. The Emergence of Transformer-Driven Copilots 17. The Consolidation of Suprahuman Transformers with OpenAI’s ChatGPT and GPT-4 18. Other Books You May Enjoy
19. Index
Appendix I — Terminology of Transformer Models 1. Appendix II — Hardware Constraints for Transformer Models 2. Appendix III — Generic Text Completion with GPT-2 3. Appendix IV — Custom Text Completion with GPT-2 4. Appendix V — Answers to the Questions

Fine-Tuning BERT Models

In Chapter 2, Getting Started with the Architecture of the Transformer Model, we defined the building blocks of the architecture of the original Transformer. Think of the original Transformer as a model built with LEGO® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers.

The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO® components.

BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. When we humans have problems understanding a sentence, we do...

The architecture of BERT

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.

We will not go through the building blocks of transformers described in Chapter 2, Getting Started with the Architecture of the Transformer Model. You can consult Chapter 2 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack. We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let’s first explore the encoder stack.

The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer, as described in Chapter...

Fine-tuning BERT

This section will fine-tune a BERT model to predict the downstream task of Acceptability Judgments and measure the predictions with the Matthews Correlation Coefficient (MCC), which will be explained in the Evaluating using Matthews Correlation Coefficient section of this chapter.

Open BERT_Fine_Tuning_Sentence_Classification_GPU.ipynb in Google Colab (make sure you have an email account). The notebook is in Chapter03 in the GitHub repository of this book.

The title of each cell in the notebook is also the same as or very close to the title of each subsection of this chapter.

We will first examine why transformer models must take hardware constraints into account.

Hardware constraints

Transformer models require multiprocessing hardware. Go to the Runtime menu in Google Colab, select Change runtime type, and select GPU in the Hardware Accelerator drop-down list.

Transformer models are hardware-driven. I recommend reading Appendix II, Hardware...

Summary

BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.

We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an Acceptability Judgment downstream task. The fine-tuning process went through all phases of the process. First, we loaded the dataset and loaded the necessary pretrained modules of the model. Then the model was trained, and its performance was measured.

Fine-tuning a pretrained model takes fewer machine resources than training downstream tasks from scratch. Fine-tuned models can...

Questions

  1. BERT stands for Bidirectional Encoder Representations from Transformers. (True/False)
  2. BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning. (True/False)
  3. Fine-tuning a BERT model implies training parameters from scratch. (True/False)
  4. BERT only pretrains using all downstream tasks. (True/False)
  5. BERT pretrains with Masked Language Modeling (MLM). (True/False)
  6. BERT pretrains with Next Sentence Predictions (NSP). (True/False)
  7. BERT pretrains mathematical functions. (True/False)
  8. A question-answer task is a downstream task. (True/False)
  9. A BERT pretraining model does not require tokenization. (True/False)
  10. Fine-tuning a BERT model takes less time than pretraining. (True/False)

References

Join our book’s Discord...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing - Second Edition
Published in: Mar 2022 Publisher: Packt ISBN-13: 9781803247335
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}