Reader small image

You're reading from  Transformers for Natural Language Processing - Second Edition

Product typeBook
Published inMar 2022
PublisherPackt
ISBN-139781803247335
Edition2nd Edition
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

From NLP to Task-Agnostic Transformer Models

Up to now, we have examined variations of the original Transformer model with encoder and decoder layers, and we explored other models with encoder-only or decoder-only stacks of layers. Also, the size of the layers and parameters has increased. However, the fundamental architecture of the transformer retains its original structure with identical layers and the parallelization of the computing of the attention heads.

In this chapter, we will explore innovative transformer models that respect the basic structure of the original Transformer but make some significant changes. Scores of transformer models will appear, like the many possibilities a box of LEGO© pieces gives. You can assemble those pieces in hundreds of ways! Transformer model sublayers and layers are the LEGO© pieces of advanced AI.

We will begin by asking which transformer model to choose among the many offers and the ecosystem we will implement them in.

...

Choosing a model and an ecosystem

We thought that testing transformer models by downloading them would require machine and human resources. Also, you might have thought that if a platform doesn’t have an online sandbox by this time, it will be a risk to go further because of the work to test a few examples.

However, sites such as Hugging Face download pretrained models automatically in real time, as we will see in The Reformer and DeBERTa sections! So, what should we do? Thanks to that, we can run Hugging Face models in Google Colab without installing anything on the machine ourselves. We can also test Hugging Face models online.

The idea is to analyze without having anything to “install.” “Nothing to Install” in 2022 can mean:

  • Running a transformer task online
  • Running a transformer on a preinstalled Google Colaboratory VM that seamlessly downloads a pretrained model for a task, which we can run in a few lines
  • Running...

The Reformer

Kitaev et al. (2020) designed the Reformer to solve the attention and memory issues, adding functionality to the original Transformer model.

The Reformer first solves the attention issue with Locality Sensitivity Hashing (LSH) buckets and chunking.

LSH searches for nearest neighbors in datasets. The hash function determines that if datapoint q is close to p, then hash(q) == hash(p). In this case, the data points are the keys of the transformer model’s heads.

The LSH function converts the keys into LSH buckets (B1 to B4 in Figure 15.3) in a process called LSH bucketing, just like how we would take objects similar to each other and put them in the same sorted buckets.

The sorted buckets are split into chunks (C1 to C4 in Figure 15.3) to parallelize. Finally, attention will only be applied within the same bucket in its chunk and the previous chunk:

Figure 15.3: LSH attention heads

LSH bucketing and chunking considerably reduce the complexity...

DeBERTa

Another new approach to transformers can be found through disentanglement. Disentanglement in AI allows you to separate the representation features to make the training process more flexible. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen designed DeBERTa, a disentangled version of a transformer, and described the model in an interesting article: DeBERTa: Decoding-enhanced BERT with Disentangled Attention: https://arxiv.org/abs/2006.03654

The two main ideas implemented in DeBERTa are:

  • Disentangle the content and position in the transformer model to train the two vectors separately
  • Use an absolute position in the decoder to predict masked tokens in the pretraining process

The authors provide the code on GitHub: https://github.com/microsoft/DeBERTa

DeBERTa exceeds the human baseline on the SuperGLUE leaderboard:

Figure 15.5: DeBERTa on the SuperGLUE leaderboard

Remove any space before Let’s run an example on Hugging...

From Task-Agnostic Models to Vision Transformers

Foundation models, as we saw in Chapter 1, What Are Transformers?, have two distinct and unique properties:

  • Emergence – Transformer models that qualify as foundation models can perform tasks they were not trained for. They are large models trained on supercomputers. They are not trained to learn specific tasks like many other models. Foundation models learn how to understand sequences.
  • Homogenization – The same model can be used across many domains with the same fundamental architecture. Foundation models can learn new skills through data faster and better than any other model.

GPT-3 and Google BERT (only the BERT models trained by Google) are task-agnostic foundation models. These task-agnostic models lead directly to ViT, CLIP, and DALL-E models. Transformers have uncanny sequence analysis abilities.

The level of abstraction of transformer models leads to multi-modal neurons:

    ...

An expanding universe of models

New transformer models, like new smartphones, emerge nearly every week. Some of these models are both mind-blowing and challenging for a project manager:

  • ERNIE is a continual pretraining framework that produces impressive results for language understanding.

    Paper: https://arxiv.org/abs/1907.12412

    Challenges: Hugging Face provides a model. Is it a full-blown model? Is it the one Baidu trained to exceed human baselines on the SuperGLUE Leaderboard (December 2021): https://super.gluebenchmark.com/leaderboard? Do we have access to the best one or just a toy model? What is the purpose of running AutoML on such small versions of models? Will we gain access to it on the Baidu platform or a similar one? How much will it cost?

  • SWITCH: A trillion-parameter model optimized with sparse modeling.

    Paper: https://arxiv.org/abs/2101.03961

    Challenges: The paper is fantastic. Where is the model? Will we ever have access...

Summary

New transformer models keep appearing on the market. Therefore, it is good practice to keep up with cutting-edge research by reading publications and books and testing some systems.

This leads us to assess which transformer models to choose and how to implement them. We cannot spend months exploring every model that appears on the market. We cannot change models every month if a project is in production. Industry 4.0 is moving to seamless API ecosystems.

Learning all the models is impossible. However, understanding a new model quickly can be achieved by deepening our knowledge of transformer models.

The basic structure of transformer models remains unchanged. The layers of the encoder and/or decoder stacks remain identical. The attention head can be parallelized to optimize computation speeds.

The Reformer model applies LSH buckets and chunking. It also recomputes each layer’s input instead of storing the information, thus optimizing memory issues. However...

Questions

  1. Reformer transformer models don’t contain encoders. (True/False)
  2. Reformer transformer models don’t contain decoders. (True/False)
  3. The inputs are stored layer by layer in Reformer models. (True/False)
  4. DeBERTa transformer models disentangle content and positions. (True/False)
  5. It is necessary to test the hundreds of pretrained transformer models before choosing one for a project. (True/False)
  6. The latest transformer model is always the best. (True/False)
  7. It is better to have one transformer model per NLP task than one multi-task transformer model. (True/False)
  8. A transformer model always needs to be fine-tuned. (True/False)
  9. OpenAI GPT-3 engines can perform a wide range of NLP tasks without fine-tuning. (True/False)
  10. It is always better to implement an AI algorithm on a local server. (True/False)

References

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing - Second Edition
Published in: Mar 2022Publisher: PacktISBN-13: 9781803247335
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman