You're reading from Machine Learning with PyTorch and Scikit-Learn

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801819312

Pages 774 pages

Edition 1st Edition

Languages

Concepts

Machine Learning

Authors (3):

Sebastian Raschka

Yuxi (Hayden) Liu

Vahid Mirjalili

View More author details

Table of Contents (22) Chapters

Preface

1. Giving Computers the Ability to Learn from Data

2. Training Simple Machine Learning Algorithms for Classification

3. A Tour of Machine Learning Classifiers Using Scikit-Learn

4. Building Good Training Datasets – Data Preprocessing

5. Compressing Data via Dimensionality Reduction

6. Learning Best Practices for Model Evaluation and Hyperparameter Tuning

7. Combining Different Models for Ensemble Learning

8. Applying Machine Learning to Sentiment Analysis

9. Predicting Continuous Target Variables with Regression Analysis

10. Working with Unlabeled Data – Clustering Analysis

11. Implementing a Multilayer Artificial Neural Network from Scratch

12. Parallelizing Neural Network Training with PyTorch

13. Going Deeper – The Mechanics of PyTorch

14. Classifying Images with Deep Convolutional Neural Networks

15. Modeling Sequential Data Using Recurrent Neural Networks

16. Transformers – Improving Natural Language Processing with Attention Mechanisms

17. Generative Adversarial Networks for Synthesizing New Data

18. Graph Neural Networks for Capturing Dependencies in Graph Structured Data

19. Reinforcement Learning for Decision Making in Complex Environments

20. Other Books You May Enjoy

21. Index

Transformers – Improving Natural Language Processing with Attention Mechanisms

In the previous chapter, we learned about recurrent neural networks (RNNs) and their applications in natural language processing (NLP) through a sentiment analysis project. However, a new architecture has recently emerged that has been shown to outperform the RNN-based sequence-to-sequence (seq2seq) models in several NLP tasks. This is the so-called transformer architecture.

Transformers have revolutionized natural language processing and have been at the forefront of many impressive applications ranging from automated language translation (https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html) and modeling fundamental properties of protein sequences (https://www.pnas.org/content/118/15/e2016239118.short) to creating an AI that helps people write code (https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer).

In this chapter, you will learn about...

Adding an attention mechanism to RNNs

In this section, we discuss the motivation behind developing an attention mechanism, which helps predictive models to focus on certain parts of the input sequence more than others, and how it was originally used in the context of RNNs. Note that this section provides a historical perspective explaining why the attention mechanism was developed. If individual mathematical details appear complicated, you can feel free to skip over them as they are not needed for the next section, explaining the self-attention mechanism for transformers, which is the focus of this chapter.

Attention helps RNNs with accessing information

To understand the development of an attention mechanism, consider the traditional RNN model for a seq2seq task like language translation, which parses the entire input sequence (for instance, one or more sentences) before producing the translation, as shown in Figure 16.1:

Figure 16.1: A traditional RNN encoder-decoder...

Introducing the self-attention mechanism

In the previous section, we saw that attention mechanisms can help RNNs with remembering context when working with long sequences. As we will see in the next section, we can have an architecture entirely based on attention, without the recurrent parts of an RNN. This attention-based architecture is known as transformer, and we will discuss it in more detail later.

In fact, transformers can appear a bit complicated at first glance. So, before we discuss transformers in the next section, let us dive into the self-attention mechanism used in transformers. In fact, as we will see, this self-attention mechanism is just a different flavor of the attention mechanism that we discussed in the previous section. We can think of the previously discussed attention mechanism as an operation that connects two different modules, that is, the encoder and decoder of the RNN. As we will see, self-attention focuses only on the input and captures only dependencies...

Attention is all we need: introducing the original transformer architecture

Interestingly, the original transformer architecture is based on an attention mechanism that was first used in an RNN. Originally, the intention behind using an attention mechanism was to improve the text generation capabilities of RNNs when working with long sentences. However, only a few years after experimenting with attention mechanisms for RNNs, researchers found that an attention-based language model was even more powerful when the recurrent layers were deleted. This led to the development of the transformer architecture, which is the main topic of this chapter and the remaining sections.

The transformer architecture was first proposed in the NeurIPS 2017 paper Attention Is All You Need by A. Vaswani and colleagues (https://arxiv.org/abs/1706.03762). Thanks to the self-attention mechanism, a transformer model can capture long-range dependencies among the elements in an input sequence—in an...

Building large-scale language models by leveraging unlabeled data

In this section, we will discuss popular large-scale transformer models that emerged from the original transformer. One common theme among these transformers is that they are pre-trained on very large, unlabeled datasets and then fine-tuned for their respective target tasks. First, we will introduce the common training procedure of transformer-based models and explain how it is different from the original transformer. Then, we will focus on popular large-scale language models including Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional and Auto-Regressive Transformers (BART).

Pre-training and fine-tuning transformer models

In an earlier section, Attention is all we need: introducing the original transformer architecture, we discussed how the original transformer architecture can be used for language translation. Language translation is...

Fine-tuning a BERT model in PyTorch

Now that we have introduced and discussed all the necessary concepts and the theory behind the original transformer and popular transformer-based models, it’s time to take a look at the more practical part! In this section, you will learn how to fine-tune a BERT model for sentiment classification in PyTorch.

Note that although there are many other transformer-based models to choose from, BERT provides a nice balance between model popularity and having a manageable model size so that it can be fine-tuned on a single GPU. Note also that pre-training a BERT from scratch is painful and quite unnecessary considering the availability of the transformers Python package provided by Hugging Face, which includes a bunch of pre-trained models that are ready for fine-tuning.

In the following sections, you’ll see how to prepare and tokenize the IMDb movie review dataset and fine-tune the distilled BERT model to perform sentiment classification...

Summary

In this chapter, we introduced a whole new model architecture for natural language processing, the transformer architecture. The transformer architecture is built on a concept called self-attention, and we started introducing this concept step by step. First, we looked at an RNN outfitted with attention in order to improve its translation capabilities for long sentences. Then, we gently introduced the concept of self-attention and explained how it is used in the multi-head attention module within the transformer.

Many different derivatives of the transformer architecture have emerged and evolved since the original transformer was published in 2017. In this chapter, we focused on a selection of some of the most popular ones: the GPT model family, BERT, and BART. GPT is a unidirectional model that is particularly good at generating new text. BERT takes a bidirectional approach, which is better suited for other types of tasks, for example, classification. Lastly, BART combines...