Fine-Tuning BERT Models
In Chapter 1, Getting Started with the Model Architecture of the Transformer, we defined the building blocks of the architecture of the original Transformer. Think of the original Transformer as a model built with LEGO® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO® components.
BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. When we humans are having problems understanding a sentence...
The architecture of BERT
BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.
We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.
We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.
We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.
Let's first explore the encoder stack.
The encoder stack
The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter...
Fine-tuning BERT
In this section, we will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC), which will be explained in the Evaluating using Matthews Correlation Coefficient section of this chapter.
Open BERT_Fine_Tuning_Sentence_Classification_DR.ipynb
in Google Colab (make sure you have an email account). The notebook is in Chapter02
of the GitHub repository of this book.
The title of each cell in the notebook is also the same, or very close to the title of each subsection of this chapter.
Let's start making sure the GPU is activated.
Activating the GPU
Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.
The program first starts by checking if the GPU is activated:
#@title Activating the GPU
# Main menu->Runtime->Change Runtime Type
import tensorflow as tf
device_name = tf.test.gpu_device_name...
Summary
BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.
We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an Acceptability Judgement downstream task. The fine-tuning process went through all phases of the process. First, we loaded the dataset and loaded the necessary pretrained modules of the model. Then the model was trained, and its performance measured.
Fine-tuning a pretrained model takes fewer machine resources than training downstream tasks from scratch. Fine-tuned models can perform...
Questions
- BERT stands for Bidirectional Encoder Representations from Transformers. (True/False)
- BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning. (True/False)
- Fine-tuning a BERT model implies training parameters from scratch. (True/False)
- BERT only pretrains using all downstream tasks. (True/False)
- BERT pretrains with Masked Language Modeling (MLM). (True/False)
- BERT pretrains with Next Sentence Predictions (NSP). (True/False)
- BERT pretrains mathematical functions. (True/False)
- A question-answer task is a downstream task. (True/False)
- A BERT pretraining model does not require tokenization. (True/False)
- Fine-tuning a BERT model takes less time than pretraining. (True/False)
References
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017, Attention Is All You Need: https://arxiv.org/abs/1706.03762
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, 2018, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: https://arxiv.org/abs/1810.04805
- Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman, 2018, Neural Network Acceptability Judgments: https://arxiv.org/abs/1805.12471
- The Corpus of Linguistic Acceptability (CoLA): https://nyu-mll.github.io/CoLA/
- Documentation on Hugging Face models: https://huggingface.co/transformers/pretrained_models.html, https://huggingface.co/transformers/model_doc/bert.html, https://huggingface.co/transformers/model_doc/roberta.html, https://huggingface.co/transformers/model_doc/distilbert.html.