Reader small image

You're reading from  Machine Learning with PyTorch and Scikit-Learn

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801819312
Edition1st Edition
Right arrow
Authors (3):
Sebastian Raschka
Sebastian Raschka
author image
Sebastian Raschka

Sebastian Raschka is an Assistant Professor of Statistics at the University of Wisconsin-Madison focusing on machine learning and deep learning research. As Lead AI Educator at Grid AI, Sebastian plans to continue following his passion for helping people get into machine learning and artificial intelligence.
Read more about Sebastian Raschka

Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

Vahid Mirjalili
Vahid Mirjalili
author image
Vahid Mirjalili

Vahid Mirjalili is a deep learning researcher focusing on CV applications. Vahid received a Ph.D. degree in both Mechanical Engineering and Computer Science from Michigan State University.
Read more about Vahid Mirjalili

View More author details
Right arrow

Transformers – Improving Natural Language Processing with Attention Mechanisms

In the previous chapter, we learned about recurrent neural networks (RNNs) and their applications in natural language processing (NLP) through a sentiment analysis project. However, a new architecture has recently emerged that has been shown to outperform the RNN-based sequence-to-sequence (seq2seq) models in several NLP tasks. This is the so-called transformer architecture.

Transformers have revolutionized natural language processing and have been at the forefront of many impressive applications ranging from automated language translation (https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html) and modeling fundamental properties of protein sequences (https://www.pnas.org/content/118/15/e2016239118.short) to creating an AI that helps people write code (https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer).

In this chapter, you will learn about...

Adding an attention mechanism to RNNs

In this section, we discuss the motivation behind developing an attention mechanism, which helps predictive models to focus on certain parts of the input sequence more than others, and how it was originally used in the context of RNNs. Note that this section provides a historical perspective explaining why the attention mechanism was developed. If individual mathematical details appear complicated, you can feel free to skip over them as they are not needed for the next section, explaining the self-attention mechanism for transformers, which is the focus of this chapter.

Attention helps RNNs with accessing information

To understand the development of an attention mechanism, consider the traditional RNN model for a seq2seq task like language translation, which parses the entire input sequence (for instance, one or more sentences) before producing the translation, as shown in Figure 16.1:

Figure 16.1: A traditional RNN encoder-decoder...

Introducing the self-attention mechanism

In the previous section, we saw that attention mechanisms can help RNNs with remembering context when working with long sequences. As we will see in the next section, we can have an architecture entirely based on attention, without the recurrent parts of an RNN. This attention-based architecture is known as transformer, and we will discuss it in more detail later.

In fact, transformers can appear a bit complicated at first glance. So, before we discuss transformers in the next section, let us dive into the self-attention mechanism used in transformers. In fact, as we will see, this self-attention mechanism is just a different flavor of the attention mechanism that we discussed in the previous section. We can think of the previously discussed attention mechanism as an operation that connects two different modules, that is, the encoder and decoder of the RNN. As we will see, self-attention focuses only on the input and captures only dependencies...

Attention is all we need: introducing the original transformer architecture

Interestingly, the original transformer architecture is based on an attention mechanism that was first used in an RNN. Originally, the intention behind using an attention mechanism was to improve the text generation capabilities of RNNs when working with long sentences. However, only a few years after experimenting with attention mechanisms for RNNs, researchers found that an attention-based language model was even more powerful when the recurrent layers were deleted. This led to the development of the transformer architecture, which is the main topic of this chapter and the remaining sections.

The transformer architecture was first proposed in the NeurIPS 2017 paper Attention Is All You Need by A. Vaswani and colleagues (https://arxiv.org/abs/1706.03762). Thanks to the self-attention mechanism, a transformer model can capture long-range dependencies among the elements in an input sequence—in an...

Building large-scale language models by leveraging unlabeled data

In this section, we will discuss popular large-scale transformer models that emerged from the original transformer. One common theme among these transformers is that they are pre-trained on very large, unlabeled datasets and then fine-tuned for their respective target tasks. First, we will introduce the common training procedure of transformer-based models and explain how it is different from the original transformer. Then, we will focus on popular large-scale language models including Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional and Auto-Regressive Transformers (BART).

Pre-training and fine-tuning transformer models

In an earlier section, Attention is all we need: introducing the original transformer architecture, we discussed how the original transformer architecture can be used for language translation. Language translation is...

Fine-tuning a BERT model in PyTorch

Now that we have introduced and discussed all the necessary concepts and the theory behind the original transformer and popular transformer-based models, it’s time to take a look at the more practical part! In this section, you will learn how to fine-tune a BERT model for sentiment classification in PyTorch.

Note that although there are many other transformer-based models to choose from, BERT provides a nice balance between model popularity and having a manageable model size so that it can be fine-tuned on a single GPU. Note also that pre-training a BERT from scratch is painful and quite unnecessary considering the availability of the transformers Python package provided by Hugging Face, which includes a bunch of pre-trained models that are ready for fine-tuning.

In the following sections, you’ll see how to prepare and tokenize the IMDb movie review dataset and fine-tune the distilled BERT model to perform sentiment classification...

Summary

In this chapter, we introduced a whole new model architecture for natural language processing, the transformer architecture. The transformer architecture is built on a concept called self-attention, and we started introducing this concept step by step. First, we looked at an RNN outfitted with attention in order to improve its translation capabilities for long sentences. Then, we gently introduced the concept of self-attention and explained how it is used in the multi-head attention module within the transformer.

Many different derivatives of the transformer architecture have emerged and evolved since the original transformer was published in 2017. In this chapter, we focused on a selection of some of the most popular ones: the GPT model family, BERT, and BART. GPT is a unidirectional model that is particularly good at generating new text. BERT takes a bidirectional approach, which is better suited for other types of tasks, for example, classification. Lastly, BART combines...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with PyTorch and Scikit-Learn
Published in: Feb 2022Publisher: PacktISBN-13: 9781801819312
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Sebastian Raschka

Sebastian Raschka is an Assistant Professor of Statistics at the University of Wisconsin-Madison focusing on machine learning and deep learning research. As Lead AI Educator at Grid AI, Sebastian plans to continue following his passion for helping people get into machine learning and artificial intelligence.
Read more about Sebastian Raschka

author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

author image
Vahid Mirjalili

Vahid Mirjalili is a deep learning researcher focusing on CV applications. Vahid received a Ph.D. degree in both Mechanical Engineering and Computer Science from Michigan State University.
Read more about Vahid Mirjalili