Transformers for Natural Language Processing

1 (1 reviews total)
By Denis Rothman
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Fine-Tuning BERT Models
About this book

The transformer architecture has proved to be revolutionary in outperforming the classical RNN and CNN models in use today. With an apply-as-you-learn approach, Transformers for Natural Language Processing investigates in vast detail the deep learning for machine translations, speech-to-text, text-to-speech, language modeling, question answering, and many more NLP domains with transformers.

The book takes you through NLP with Python and examines various eminent models and datasets within the transformer architecture created by pioneers such as Google, Facebook, Microsoft, OpenAI, and Hugging Face.

The book trains you in three stages. The first stage introduces you to transformer architectures, starting with the original transformer, before moving on to RoBERTa, BERT, and DistilBERT models. You will discover training methods for smaller transformers that can outperform GPT-3 in some cases. In the second stage, you will apply transformers for Natural Language Understanding (NLU) and Natural Language Generation (NLG). Finally, the third stage will help you grasp advanced language understanding techniques such as optimizing social network datasets and fake news identification.

By the end of this NLP book, you will understand transformers from a cognitive science perspective and be proficient in applying pretrained transformer models by tech giants to various datasets.

Publication date:
January 2021


Fine-Tuning BERT Models

In Chapter 1, Getting Started with the Model Architecture of the Transformer, we defined the building blocks of the architecture of the original Transformer. Think of the original Transformer as a model built with LEGO® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO® components.

BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. When we humans are having problems understanding a sentence...


The architecture of BERT

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.

We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.

We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let's first explore the encoder stack.

The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter...


Fine-tuning BERT

In this section, we will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC), which will be explained in the Evaluating using Matthews Correlation Coefficient section of this chapter.

Open BERT_Fine_Tuning_Sentence_Classification_DR.ipynb in Google Colab (make sure you have an email account). The notebook is in Chapter02 of the GitHub repository of this book.

The title of each cell in the notebook is also the same, or very close to the title of each subsection of this chapter.

Let's start making sure the GPU is activated.

Activating the GPU

Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.

The program first starts by checking if the GPU is activated:

#@title Activating the GPU
# Main menu->Runtime->Change Runtime Type
import tensorflow as tf
device_name = tf.test.gpu_device_name...


BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.

We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an Acceptability Judgement downstream task. The fine-tuning process went through all phases of the process. First, we loaded the dataset and loaded the necessary pretrained modules of the model. Then the model was trained, and its performance measured.

Fine-tuning a pretrained model takes fewer machine resources than training downstream tasks from scratch. Fine-tuned models can perform...



  1. BERT stands for Bidirectional Encoder Representations from Transformers. (True/False)
  2. BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning. (True/False)
  3. Fine-tuning a BERT model implies training parameters from scratch. (True/False)
  4. BERT only pretrains using all downstream tasks. (True/False)
  5. BERT pretrains with Masked Language Modeling (MLM). (True/False)
  6. BERT pretrains with Next Sentence Predictions (NSP). (True/False)
  7. BERT pretrains mathematical functions. (True/False)
  8. A question-answer task is a downstream task. (True/False)
  9. A BERT pretraining model does not require tokenization. (True/False)
  10. Fine-tuning a BERT model takes less time than pretraining. (True/False)


About the Author
  • Denis Rothman

    Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive natural language processing (NLP) chatbots applied as an automated language teacher for Moët et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an advanced planning and scheduling (APS) solution used worldwide.

    Browse publications by this author
Latest Reviews (1 reviews total)
No the best for me, I have paid the book Twice !!!
Recommended For You
Transformers for Natural Language Processing
Unlock this book and the full library FREE for 7 days
Start now