Reader small image

You're reading from  Deep Learning with TensorFlow and Keras – 3rd edition - Third Edition

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781803232911
Edition3rd Edition
Right arrow
Authors (3):
Amita Kapoor
Amita Kapoor
author image
Amita Kapoor

Amita Kapoor is an accomplished AI consultant and educator, with over 25 years of experience. She has received international recognition for her work, including the DAAD fellowship and the Intel Developer Mesh AI Innovator Award. She is a highly respected scholar in her field, with over 100 research papers and several best-selling books on deep learning and AI. After teaching for 25 years at the University of Delhi, Amita took early retirement and turned her focus to democratizing AI education. She currently serves as a member of the Board of Directors for the non-profit Neuromatch Academy, fostering greater accessibility to knowledge and resources in the field. Following her retirement, Amita also founded NePeur, a company that provides data analytics and AI consultancy services. In addition, she shares her expertise with a global audience by teaching online classes on data science and AI at the University of Oxford.
Read more about Amita Kapoor

Antonio Gulli
Antonio Gulli
author image
Antonio Gulli

Antonio Gulli has a passion for establishing and managing global technological talent for innovation and execution. His core expertise is in cloud computing, deep learning, and search engines. Currently, Antonio works for Google in the Cloud Office of the CTO in Zurich, working on Search, Cloud Infra, Sovereignty, and Conversational AI.
Read more about Antonio Gulli

Sujit Pal
Sujit Pal
author image
Sujit Pal

Sujit Pal is a Technology Research Director at Elsevier Labs, an advanced technology group within the Reed-Elsevier Group of companies. His interests include semantic search, natural language processing, machine learning, and deep learning. At Elsevier, he has worked on several initiatives involving search quality measurement and improvement, image classification and duplicate detection, and annotation and ontology development for medical and scientific corpora.
Read more about Sujit Pal

View More author details
Right arrow

Transformers

The transformer-based architectures have become almost universal in Natural Language Processing (NLP) (and beyond) when it comes to solving a wide variety of tasks, such as:

  • Neural machine translation
  • Text summarization
  • Text generation
  • Named entity recognition
  • Question answering
  • Text classification
  • Text similarity
  • Offensive message/profanity detection
  • Query understanding
  • Language modeling
  • Next-sentence prediction
  • Reading comprehension
  • Sentiment analysis
  • Paraphrasing

and a lot more.

In less than four years, when the Attention Is All You Need paper was published by Google Research in 2017, transformers managed to take the NLP community by storm, breaking any record achieved over the previous thirty years.

Transformer-based models use the so-called attention mechanisms that identify complex relationships between words in each input sequence, such as a sentence. Attention...

Architecture

Even though a typical transformer architecture is usually different from that of recurrent networks, it is based on several key ideas that originated in RNNs. At the time of writing this book, the transformer represents the next evolutionary step of deep learning architectures related to texts and any data that can be represented as sequences, and as such, it should be an essential part of your toolbox.

The original transformer architecture is a variant of the encoder-decoder architecture, where the recurrent layers are replaced with (self-)attention layers. The transformer was initially proposed by Google in the seminal paper titled Attention Is All You Need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, 2017, https://arxiv.org/abs/1706.03762, to which a reference implementation was provided, which we will refer to throughout this discussion.

The architecture is an instance of the...

Transformers’ architectures

In this section, we have provided a high-level overview of both the most important architectures used by transformers and of the different ways used to compute attention.

Categories of transformers

In this section, we are going to classify transformers into different categories. The next paragraph will introduce the most common transformers.

Decoder or autoregressive

A typical example is a GPT (Generative Pre-Trained) model, which you can learn more about in the GPT-2 and GPT-3 sections later in this chapter, or refer to https://openai.com/blog/language-unsupervised). Autoregressive models use only the decoder of the original transformer model, with the attention heads that can only see what is before in the text and not after with a masking mechanism used on the full sentence. Autoregressive models use pretraining to guess the next token after observing all the previous ones. Typically, autoregressive models are used for Natural...

Pretraining

As you have learned earlier, the original transformer had an encoder-decoder architecture. However, the research community understood that there are situations where it is beneficial to have only the encoder, or only the decoder, or both.

Encoder pretraining

As discussed, these models are also called auto-encoding and they use only the encoder during the pretraining. Pretraining is carried out by masking words in the input sequence and training the model to reconstruct the sequence. Typically, the encoder can access all the input words. Encoder-only models are generally used for classification.

Decoder pretraining

Decoder models are referred to as autoregressive. During pretraining, the decoder is optimized to predict the next word. In particular, the decoder can only access all the words positioned before a given word in the sequence. Decoder-only models are generally used for text generation.

Encoder-decoder pretraining

In this case, the model...

Implementation

In this section, we will go through a few tasks using transformers.

Transformer reference implementation: An example of translation

In this section, we will briefly review a transformer reference implementation available at https://www.tensorflow.org/text/tutorials/transformer and specifically, we will use the opportunity to run the code in a Google Colab.

Not everyone realizes the number of GPUs it takes to train a transformer. Luckily, you can play with resources available for free at https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/transformer.ipynb.

Note that implementing transformers from scratch is probably not the best choice unless you need to realize some very specific customization or you are interested in core research. If you are not interested in learning the internals, then you can skip to the next section. Our tutorial is licensed under the Creative Commons Attribution 4.0 License, and code samples are...

Evaluation

Evaluating transformers involves considering multiple classes of metrics and understanding the cost tradeoffs among these classes. Let’s see the main ones.

Quality

The quality of transformers can be measured against a number of generally available datasets. Let’s see the most commonly used ones.

GLUE

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE is available at https://gluebenchmark.com/.

GLUE consists of:

  • A benchmark of nine sentence or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty
  • A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language
  • A public leaderboard for tracking...

Optimization

Optimizing a transformer involves building lightweight, responsive, and energy-efficient models. Let’s see the most common ideas adopted to optimize a model.

Quantization

The key idea behind quantization is to approximate the weights of a network with a smaller precision. The idea is very simple, but it works quite well in practice. If you are interested in knowing more, we recommend the paper A Survey of Quantization Methods for Efficient Neural Network Inference, by Amir Gholami et al., https://arxiv.org/pdf/2103.13630.pdf.

Weight pruning

The key idea behind weight pruning is to remove some connections in the network. Magnitude-based weight pruning tends to zero out of model weights during training to increase model sparsity. This simple technique has benefits both in terms of model size and in cost of serving, as magnitude-based weight pruning gradually zeroes out of model weights during the training process to achieve model sparsity. Sparse...

Common pitfalls: dos and don’ts

In this section, we will give five dos and a few don’ts that are typically recommended when dealing with transformers.

Dos

Let’s start with recommended best practices:

  • Do use pretrained large models. Today, it is almost always convenient to start from an already available pretrained model such as T5, instead of training your transformer from scratch. If you use a pretrained model, you for sure stand on the giant’s shoulders; think about it!
  • Do start with few-shot learning. When you start working with transformers, it’s always a good idea to start with a pretrained model and then perform a lightweight few-shot learning step. Generally, this would improve the quality of results without high computational costs.
  • Do use fine-tuning on your domain data and on your customer data. After playing with pretraining models and few-shot learning, you might consider doing a proper fine-tuning on your...

The future of transformers

Transformers found their initial applications in NLP tasks, while CNNs are typically used for image processing systems. Recently, transformers have started to be successfully used for vision processing tasks. Vision transformers compute relationships among pixels in various small sections of an image (for example, 16 x 16 pixels). This approach has been proposed in the seminar paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy et al., https://arxiv.org/abs/2010.11929, to make the attention computation feasible.

Vision transformers (ViTs) are today used for complex applications such as autonomous driving. Tesla’s engineers showed that their Tesla Autopilot uses a transformer on the multi-camera system in cars. Of course, ViTs are also used for more traditional computer vision tasks, including but not limited to image classification, object detection, video deepfake detection, image segmentation...

Summary

In this chapter, we discussed transformers, a deep learning architecture that has revolutionized the traditional natural language processing field. We started reviewing the key intuitions behind the architecture, and various categories of transformers together with a deep dive into the most popular models. Then, we focused on implementations both based on vanilla architecture and on popular libraries such as Hugging Face and TFHub. After that, we briefly discussed evaluation, optimization, and some of the best practices commonly adopted when using transformers. The last section was devoted to reviewing how transformers can be used to perform computer vision tasks, a totally different domain from NLP. That requires a careful definition of the attention mechanism. In the end, attention is all you need! And at the core of attention is nothing more than the cosine similarity between vectors.

The next chapter is devoted to unsupervised learning.

Join our book’s Discord...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with TensorFlow and Keras – 3rd edition - Third Edition
Published in: Oct 2022Publisher: PacktISBN-13: 9781803232911
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Amita Kapoor

Amita Kapoor is an accomplished AI consultant and educator, with over 25 years of experience. She has received international recognition for her work, including the DAAD fellowship and the Intel Developer Mesh AI Innovator Award. She is a highly respected scholar in her field, with over 100 research papers and several best-selling books on deep learning and AI. After teaching for 25 years at the University of Delhi, Amita took early retirement and turned her focus to democratizing AI education. She currently serves as a member of the Board of Directors for the non-profit Neuromatch Academy, fostering greater accessibility to knowledge and resources in the field. Following her retirement, Amita also founded NePeur, a company that provides data analytics and AI consultancy services. In addition, she shares her expertise with a global audience by teaching online classes on data science and AI at the University of Oxford.
Read more about Amita Kapoor

author image
Antonio Gulli

Antonio Gulli has a passion for establishing and managing global technological talent for innovation and execution. His core expertise is in cloud computing, deep learning, and search engines. Currently, Antonio works for Google in the Cloud Office of the CTO in Zurich, working on Search, Cloud Infra, Sovereignty, and Conversational AI.
Read more about Antonio Gulli

author image
Sujit Pal

Sujit Pal is a Technology Research Director at Elsevier Labs, an advanced technology group within the Reed-Elsevier Group of companies. His interests include semantic search, natural language processing, machine learning, and deep learning. At Elsevier, he has worked on several initiatives involving search quality measurement and improvement, image classification and duplicate detection, and annotation and ontology development for medical and scientific corpora.
Read more about Sujit Pal