In Chapter 7, Understanding Recurrent Networks, we outlined several types of recurrent models, depending on the input-output combinations. One of them is indirect many-to-many or sequence-to-sequence (seq2seq), where an input sequence is transformed into another, different output sequence, not necessarily with the same length as the input. Machine translation is the most popular type of seq2seq task. The input sequences are the words of a sentence in one language and the output sequences are the words of the same sentence translated into another language. For example, we can translate the English sequence tourist attraction to the German touristenattraktion. Not only is the output sentence a different length, but there is no direct correspondence between the elements of the input and output sequences. In particular, one output element...
You're reading from Advanced Deep Learning with Python
Introducing seq2seq models
Seq2seq, or encoder-decoder (see Sequence to Sequence Learning with Neural Networks at https://arxiv.org/abs/1409.3215), models use RNNs in a way that's especially suited for solving tasks with indirect many-to-many relationships between the input and the output. A similar model was also proposed in another pioneering paper, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (go to https://arxiv.org/abs/1406.1078 for more information). The following is a diagram of the seq2seq model. The input sequence [A, B, C, <EOS>] is decoded into the output sequence [W, X, Y, Z, <EOS>]:
The model consists of two parts: an encoder and a decoder. Here's how the inference part works:
- The encoder is an RNN. The original paper uses LSTM, but GRU...
Seq2seq with attention
The decoder has to generate the entire output sequence based solely on the thought vector. For this to work, the thought vector has to encode all of the information of the input sequence; however, the encoder is an RNN, and we can expect that its hidden state will carry more information about the latest sequence elements than the earliest. Using LSTM cells and reversing the input helps, but cannot prevent it entirely. Because of this, the thought vector becomes something of a bottleneck. As a result, the seq2seq model works well for short sentences, but the performance deteriorates for longer ones.
Bahdanau attention
We can solve this problem with the help of the attention mechanism (see Neural Machine...
Understanding transformers
We spent the better part of this chapter touting the advantages of the attention mechanism. But we still use attention in the context of RNNs—in that sense, it works as an addition on top of the core recurrent nature of these models. Since attention is so good, is there a way to use it on its own without the RNN part? It turns out that there is. The paper Attention is all you need (https://arxiv.org/abs/1706.03762) introduces a new architecture called transformer with encoder and decoder that relies solely on the attention mechanism. First, we'll focus our attention on the transformer attention (pun intended) mechanism.
The transformer attention
Before focusing on the entire model, let...
Transformer language models
In Chapter 6, Language Modeling, we introduced several different language models (word2vec, GloVe, and fastText) that use the context of a word (its surrounding words) to create word vectors (embeddings). These models share some common properties:
- They are context-free (I know it contradicts the previous statement) because they create a single global word vector of each word based on all its occurrences in the training text. For example, lead can have completely different meanings in the phrases lead the way and lead atom, yet the model will try to embed both meanings in the same word vector.
- They are position-free because they don't take into account the order of the contextual words when training for the embedding vectors.
In contrast, it's possible to create transformer-based language models, which are both context- and position-dependent...
Summary
In this chapter, we focused on seq2seq models and the attention mechanism. First, we discussed and implemented a regular recurrent encoder-decoder seq2seq model and learned how to complement it with the attention mechanism. Then, we talked about and implemented a purely attention-based type of model called a transformer. We also defined multihead attention in their context. Next, we discussed transformer language models (such as BERT, transformerXL, and XLNet). Finally, we implemented a simple text-generation example using the transformers library.
This chapter concludes our series of chapters with a focus on natural language processing. In the next chapter, we'll talk about some new trends in deep learning that aren't fully matured yet but hold great potential for the future.