Reader small image

You're reading from  Mastering Transformers

Product typeBook
Published inSep 2021
PublisherPackt
ISBN-139781801077651
Edition1st Edition
Right arrow
Authors (2):
Savaş Yıldırım
Savaş Yıldırım
author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

Meysam Asgari- Chenaghlu
Meysam Asgari- Chenaghlu
author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu

View More author details
Right arrow

Chapter 3: Autoencoding Language Models

In the previous chapter, we looked at and studied how a typical Transformer model can be used by HuggingFace's Transformers. So far, all the topics have included how to use pre-defined or pre-built models and less information has been given about specific models and their training.

In this chapter, we will gain knowledge of how we can train autoencoding language models on any given language from scratch. This training will include pre-training and task-specific training of the models. First, we will start with basic knowledge about the BERT model and how it works. Then we will train the language model using a simple and small corpus. Afterward, we will look at how the model can be used inside any Keras model.

For an overview of what will be learned in this chapter, we will discuss the following topics:

  • BERT – one of the autoencoding language models
  • Autoencoding language model training for any language
  • Sharing...

Technical requirements

The technical requirements for this chapter are as follows:

  • Anaconda
  • Transformers >= 4.0.0
  • PyTorch >= 1.0.2
  • TensorFlow >= 2.4.0
  • Datasets >= 1.4.1
  • Tokenizers

Please also check the corresponding GitHub code of chapter 03:

https://github.com/PacktPublishing/Advanced-Natural-Language-Processing-with-Transformers/tree/main/CH03.

Check out the following link to see Code in Action Video: https://bit.ly/3i1ycdY

BERT – one of the autoencoding language models

Bidirectional Encoder Representations from Transformers, also known as BERT, was one of the first autoencoding language models to utilize the encoder Transformer stack with slight modifications for language modeling.

The BERT architecture is a multilayer Transformer encoder based on the Transformer original implementation. The Transformer model itself was originally for machine translation tasks, but the main improvement made by BERT is the utilization of this part of the architecture to provide better language modeling. This language model, after pretraining, is able to provide a global understanding of the language it is trained on.

BERT language model pretraining tasks

To have a clear understanding of the masked language modeling used by BERT, let's define it with more details. Masked language modeling is the task of training a model on input (a sentence with some masked tokens) and obtaining the output as the whole...

Autoencoding language model training for any language

We have discussed how BERT works and that it is possible to use the pretrained version of it provided by the HuggingFace repository. In this section, you will learn how to use the HuggingFace library to train your own BERT.

Before we start, it is essential to have good training data, which will be used for the language modeling. This data is called the corpus, which is normally a huge pile of data (sometimes it is preprocessed and cleaned). This unlabeled corpus must be appropriate for the use case you wish to have your language model trained on; for example, if you are trying to have a special BERT for, let's say, the English language. Although there are tons of huge, good datasets, such as Common Crawl (https://commoncrawl.org/), we would prefer a small one for faster training.

The IMDB dataset of 50K movie reviews (available at https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) is a large dataset...

Sharing models with the community

HuggingFace provides a very easy-to-use model-sharing mechanism:

  1. You can simply use the following cli tool to log in:
    Transformers-cli login
  2. After you've logged in using your own credentials, you can create a repository:
    Transformers-cli repo create a-fancy-model-name
  3. You can put any model name for the a-fancy-model-name parameter and then it is essential to make sure you have git-lfs:
    git lfs install

    Git LFS is a Git extension used for handling large files. HuggingFace pretrained models are usually large files that require extra libraries such as LFS to be handled by Git.

  4. Then you can clone the repository you have just created:
    git clone https://huggingface.co/username/a-fancy-model-name
  5. Afterward, you can add and remove from the repository as you like, and then, just like Git usage, you have to run the following command:
    git add . && git commit -m "Update from $USER"
    git push

Autoencoding models rely...

Understanding other autoencoding models

In this part, we will review autoencoding model alternatives that slightly modify the original BERT. These alternative re-implementations have led to better downstream tasks by exploiting many sources: optimizing the pre-training process and the number of layers or heads, improving data quality, designing better objective functions, and so forth. The source of improvements roughly falls into two parts: better architectural design choice and pre-training control.

Many effective alternatives have been shared lately, so it is impossible to understand and explain them all here. We can take a look at some of the most cited models in the literature and the most used ones on NLP benchmarks. Let's start with Albert as a re-implementation of BERT that focuses especially on architectural design choice.

Introducing ALBERT

The performance of language models is considered to improve as their size gets bigger. However, training such models...

Working with tokenization algorithms

In the opening part of the chapter, we trained the BERT model using a specific tokenizer, namely BertWordPieceTokenizer. Now it is worth discussing the tokenization process in detail here. Tokenization is a way of splitting textual input into tokens and assigning an identifier to each token before feeding the neural network architecture. The most intuitive way is to split the sequence into smaller chunks in terms of space. However, such approaches do not meet the requirement of some languages, such as Japanese, and also may lead to huge vocabulary problems. Almost all Transformer models leverage subword tokenization not only for reducing dimensionality but also for encoding rare (or unknown) words not seen in training. The tokenization relies on the idea that every word, including rare words or unknown words, can be decomposed into meaningful smaller chunks that are widely seen symbols in the training corpus.

Some traditional tokenizers developed...

Summary

In this chapter, we have experienced autoencoding models both theoretically and practically. Starting with basic knowledge about BERT, we trained it as well as a corresponding tokenizer from scratch. We also discussed how to work inside other frameworks, such as Keras. Besides BERT, we also reviewed other autoencoding models. To avoid excessive code repetition, we did not provide the full implementation for training other models. During the BERT training, we trained the WordPiece tokenization algorithm. In the last part, we examined other tokenization algorithms since it is worth discussing and understanding all of them.

Autoencoding models use the left decoder side of the original Transformer and are mostly fine-tuned for classification problems. In the next chapter, we will discuss and learn about the right decoder part of Transformers to implement language generation models.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Transformers
Published in: Sep 2021Publisher: PacktISBN-13: 9781801077651
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Savaş Yıldırım

Savaş Yıldırım graduated from the Istanbul Technical University Department of Computer Engineering and holds a Ph.D. degree in Natural Language Processing (NLP). Currently, he is an associate professor at the Istanbul Bilgi University, Turkey, and is a visiting researcher at the Ryerson University, Canada. He is a proactive lecturer and researcher with more than 20 years of experience teaching courses on machine learning, deep learning, and NLP. He has significantly contributed to the Turkish NLP community by developing a lot of open source software and resources. He also provides comprehensive consultancy to AI companies on their R&D projects. In his spare time, he writes and directs short films, and enjoys practicing yoga.
Read more about Savaş Yıldırım

author image
Meysam Asgari- Chenaghlu

Meysam Asgari-Chenaghlu is an AI manager at Carbon Consulting and is also a Ph.D. candidate at the University of Tabriz. He has been a consultant for Turkey's leading telecommunication and banking companies. He has also worked on various projects, including natural language understanding and semantic search.
Read more about Meysam Asgari- Chenaghlu