You're reading from Mastering Transformers

Product type Book

Published in Sep 2021

Publisher Packt

ISBN-13 9781801077651

Pages 374 pages

Edition 1st Edition

Languages

Concepts

Mobile Application Development

Authors (2):

Savaş Yıldırım

Meysam Asgari- Chenaghlu

View More author details

Table of Contents (16) Chapters

Preface

1. Section 1: Introduction – Recent Developments in the Field, Installations, and Hello World Applications

2. Chapter 1: From Bag-of-Words to the Transformer

3. Chapter 2: A Hands-On Introduction to the Subject

4. Section 2: Transformer Models – From Autoencoding to Autoregressive Models

5. Chapter 3: Autoencoding Language Models

6. Chapter 4:Autoregressive and Other Language Models

7. Chapter 5: Fine-Tuning Language Models for Text Classification

8. Chapter 6: Fine-Tuning Language Models for Token Classification

9. Chapter 7: Text Representation

10. Section 3: Advanced Topics

11. Chapter 8: Working with Efficient Transformers

12. Chapter 9:Cross-Lingual and Multilingual Language Modeling

13. Chapter 10: Serving Transformer Models

14. Chapter 11: Attention Visualization and Experiment Tracking

15. Other Books You May Enjoy

Chapter 3: Autoencoding Language Models

In the previous chapter, we looked at and studied how a typical Transformer model can be used by HuggingFace's Transformers. So far, all the topics have included how to use pre-defined or pre-built models and less information has been given about specific models and their training.

In this chapter, we will gain knowledge of how we can train autoencoding language models on any given language from scratch. This training will include pre-training and task-specific training of the models. First, we will start with basic knowledge about the BERT model and how it works. Then we will train the language model using a simple and small corpus. Afterward, we will look at how the model can be used inside any Keras model.

For an overview of what will be learned in this chapter, we will discuss the following topics:

BERT – one of the autoencoding language models
Autoencoding language model training for any language
Sharing...

Technical requirements

The technical requirements for this chapter are as follows:

Anaconda
Transformers >= 4.0.0
PyTorch >= 1.0.2
TensorFlow >= 2.4.0
Datasets >= 1.4.1
Tokenizers

Please also check the corresponding GitHub code of chapter 03:

https://github.com/PacktPublishing/Advanced-Natural-Language-Processing-with-Transformers/tree/main/CH03.

Check out the following link to see Code in Action Video: https://bit.ly/3i1ycdY

BERT – one of the autoencoding language models

Bidirectional Encoder Representations from Transformers, also known as BERT, was one of the first autoencoding language models to utilize the encoder Transformer stack with slight modifications for language modeling.

The BERT architecture is a multilayer Transformer encoder based on the Transformer original implementation. The Transformer model itself was originally for machine translation tasks, but the main improvement made by BERT is the utilization of this part of the architecture to provide better language modeling. This language model, after pretraining, is able to provide a global understanding of the language it is trained on.

BERT language model pretraining tasks

To have a clear understanding of the masked language modeling used by BERT, let's define it with more details. Masked language modeling is the task of training a model on input (a sentence with some masked tokens) and obtaining the output as the whole...

Autoencoding language model training for any language

We have discussed how BERT works and that it is possible to use the pretrained version of it provided by the HuggingFace repository. In this section, you will learn how to use the HuggingFace library to train your own BERT.

Before we start, it is essential to have good training data, which will be used for the language modeling. This data is called the corpus, which is normally a huge pile of data (sometimes it is preprocessed and cleaned). This unlabeled corpus must be appropriate for the use case you wish to have your language model trained on; for example, if you are trying to have a special BERT for, let's say, the English language. Although there are tons of huge, good datasets, such as Common Crawl (https://commoncrawl.org/), we would prefer a small one for faster training.

The IMDB dataset of 50K movie reviews (available at https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) is a large dataset...

Sharing models with the community

HuggingFace provides a very easy-to-use model-sharing mechanism:

You can simply use the following cli tool to log in:
```
Transformers-cli login
```
After you've logged in using your own credentials, you can create a repository:
```
Transformers-cli repo create a-fancy-model-name
```
You can put any model name for the a-fancy-model-name parameter and then it is essential to make sure you have git-lfs:
```
git lfs install
```
Git LFS is a Git extension used for handling large files. HuggingFace pretrained models are usually large files that require extra libraries such as LFS to be handled by Git.

Then you can clone the repository you have just created:

git clone https://huggingface.co/username/a-fancy-model-name

Afterward, you can add and remove from the repository as you like, and then, just like Git usage, you have to run the following command:
```
git add . && git commit -m "Update from $USER"
git push
```

Autoencoding models rely...

Understanding other autoencoding models

In this part, we will review autoencoding model alternatives that slightly modify the original BERT. These alternative re-implementations have led to better downstream tasks by exploiting many sources: optimizing the pre-training process and the number of layers or heads, improving data quality, designing better objective functions, and so forth. The source of improvements roughly falls into two parts: better architectural design choice and pre-training control.

Many effective alternatives have been shared lately, so it is impossible to understand and explain them all here. We can take a look at some of the most cited models in the literature and the most used ones on NLP benchmarks. Let's start with Albert as a re-implementation of BERT that focuses especially on architectural design choice.

Introducing ALBERT

The performance of language models is considered to improve as their size gets bigger. However, training such models...

Working with tokenization algorithms

In the opening part of the chapter, we trained the BERT model using a specific tokenizer, namely BertWordPieceTokenizer. Now it is worth discussing the tokenization process in detail here. Tokenization is a way of splitting textual input into tokens and assigning an identifier to each token before feeding the neural network architecture. The most intuitive way is to split the sequence into smaller chunks in terms of space. However, such approaches do not meet the requirement of some languages, such as Japanese, and also may lead to huge vocabulary problems. Almost all Transformer models leverage subword tokenization not only for reducing dimensionality but also for encoding rare (or unknown) words not seen in training. The tokenization relies on the idea that every word, including rare words or unknown words, can be decomposed into meaningful smaller chunks that are widely seen symbols in the training corpus.

Some traditional tokenizers developed...

Summary

In this chapter, we have experienced autoencoding models both theoretically and practically. Starting with basic knowledge about BERT, we trained it as well as a corresponding tokenizer from scratch. We also discussed how to work inside other frameworks, such as Keras. Besides BERT, we also reviewed other autoencoding models. To avoid excessive code repetition, we did not provide the full implementation for training other models. During the BERT training, we trained the WordPiece tokenization algorithm. In the last part, we examined other tokenization algorithms since it is worth discussing and understanding all of them.

Autoencoding models use the left decoder side of the original Transformer and are mostly fine-tuned for classification problems. In the next chapter, we will discuss and learn about the right decoder part of Transformers to implement language generation models.