Reader small image

You're reading from  Modern Time Series Forecasting with Python

Product typeBook
Published inNov 2022
PublisherPackt
ISBN-139781803246802
Edition1st Edition
Concepts
Right arrow
Author (1)
Manu Joseph
Manu Joseph
author image
Manu Joseph

Manu Joseph is a self-made data scientist with more than a decade of experience working with many Fortune 500 companies enabling digital and AI transformations, specifically in machine learning-based demand forecasting. He is considered an expert, thought leader, and strong voice in the world of time series forecasting. Currently, Manu leads applied research at Thoucentric, where he advances research by bringing cutting-edge AI technologies to the industry. He is also an active open-source contributor and developed an open-source library—PyTorch Tabular—which makes deep learning for tabular data easy and accessible. Originally from Thiruvananthapuram, India, Manu currently resides in Bengaluru, India, with his wife and son
Read more about Manu Joseph

Right arrow

Attention and Transformers for Time Series

In the previous chapter, we rolled up our sleeves and implemented a few deep learning (DL) systems for time series forecasting. We used the common building blocks we discussed in Chapter 12, Building Blocks of Deep Learning for Time Series, put them together in an encoder-decoder architecture, and trained them to produce the forecast we desired.

Now, let’s talk about another key concept in DL that has taken the field by storm over the past few years—attention. Attention has a long-standing history, which has culminated in it being one of the most sought-after tools in the DL toolkit. This chapter takes you on a journey to understand attention and transformer models from the ground up from a theoretical perspective and solidify that understanding with practical examples.

In this chapter, we will be covering these main topics:

  • What is attention?
  • Generalized attention model
  • Forecasting with sequence-to-sequence...

Technical requirements

If you have not set up the Anaconda environment following the instructions in the Preface, please do that in order to get a working environment with all the packages and datasets required for the code in this book.

You need to run the following notebooks for this chapter:

  • 02 - Preprocessing London Smart Meter Dataset.ipynb in Chapter02
  • 01-Setting up Experiment Harness.ipynb in Chapter04
  • 01-Feature Engineering.ipynb in Chapter06
  • 02-One-Step RNN.ipynb and 03-Seq2Seq RNN.ipynb in Chapter13 (for benchmarking)
  • 00-Single Step Backtesting Baselines.ipynb and 01-Forecasting with ML.ipynb in Chapter08

The associated code for the chapter can be found at https://github.com/PacktPublishing/Modern-Time-Series-Forecasting-with-Python-/tree/main/notebooks/Chapter14.

What is attention?

The idea of attention was inspired by human cognitive function. At any moment, the optic nerves in our eyes, the olfactory nerves in our noses, and the auditory nerves in our ears send a massive amount of sensory input to the brain. This is way too much information, definitely more than the brain can handle. But our brains have developed a mechanism that helps us to pay attention to only the stimuli that matter—such as a sound or a smell that doesn’t belong. Years of evolution have trained our brains to pick out anomalous sounds or smells because that was key for us surviving in the wild where predators roamed free.

Apart from this kind of instinctive attention, we are also able to control our attention by what we call focusing on something. You are doing it right now by choosing to ignore all the other stimuli that you are getting and focusing your attention on the contents of this book. While you are reading, your mobile phone pings you, and the...

The generalized attention model

Over the course of years, researchers have come up with different ways of calculating attention weights and using attention in DL models. Sneha Choudhari et al. published a survey paper on attention models that proposes a generalized attention model that tries to incorporate all the variations in a single framework. Let’s structure our discussion around this generalized framework.

We can think of an attention model as learning an attention distribution () for a set of keys, K, using a set of queries, q. In the example we discussed in the last section, the query would be —the hidden state from the last timestep during decoding—and the keys would be —all the hidden states generated using the input sequence. In some cases, the generated attention distribution is applied to another set of inputs called values, V. In many cases, K and V are the same, but to maintain the general form of the framework, we consider these separately...

Forecasting with sequence-to-sequence models and attention

Let’s pick up the thread from Chapter 13, Common Modeling Patterns for Time Series, where we used Seq2Seq models to forecast a sample household (if you have not read the previous chapter, I strongly suggest you do it now) and modify the Seq2SeqModel class to also include an attention mechanism.

Notebook alert

To follow along with the complete code, use the notebook named 01-Seq2Seq RNN with Attention.ipynb in the Chapter14 folder and the code in the src folder.

We are still going to inherit the BaseModel class we have defined in src/dl/models.py, and the overall structure is going to be very similar to the Seq2SeqModel class. The key difference will be that in our new model, with attention, we do not accept a fully connected layer as the decoder. It is not because it is not possible, but for convenience and brevity of the implementation. In fact, implementing a Seq2Seq model with a fully connected decoder can...

Transformers – Attention is all you need

While the introduction of attention was a shot in the arm for RNNs and Seq2Seq models, they still had one problem. The RNNs were recurrent, and that meant it needed to process each word in a sentence in a sequential manner. And for popular Seq2Seq model applications such as language translation, it meant processing long sequences of words became really time-consuming. In short, it was difficult to scale them to a large corpus of data. In 2017, Vaswani et al. authored a seminal paper titled Attention Is All You Need. Just as the title of the paper implies, they explored an architecture that used attention (scaled dot product attention) and threw away recurrent networks altogether. And to the surprise of NLP researchers around the world, these models (which were dubbed Transformers) outperformed the then state-of-the-art Seq2Seq models in language translation.

This spurred a flurry of research activity around this new class of models...

Forecasting with Transformers

For some continuity, we will continue with the same household we were forecasting with RNNs and RNNs with attention.

Notebook alert

To follow along with the complete code, use the notebook named 03-Transformers.ipynb in the Chapter14 folder and the code in the src folder.

Although we learned about the vanilla Transformer as a model with an encoder-decoder architecture, it was really designed for language translation tasks. In language translation, the source sequence and target sequence are quite different, and therefore the encoder-decoder architecture made sense. But soon after, researchers figured out that using the decoder part of the Transformer alone does well. It is called a decoder-only Transformer in literature. The naming is a bit confusing because if you think about it, the decoder is different from the encoder in two things—masked self-attention and encoder-decoder attention. So, in a decoder-only Transformer, how do we have...

Summary

We have been storming through the world of DL in the last few chapters. We started off with the basic premise of DL, what it is, and why it became so popular. Then, we saw a few common building blocks that are typically used in time series forecasting and got our hands dirty, learning how we can put what we have learned into practice using PyTorch. Although we talked about RNNs, LSTMs, GRUs, and so on, we purposefully left out attention and Transformers because they deserved a separate chapter.

We started the chapter by learning about the generalized attention model, helping you put a framework around all the different schemes of attention out there, and then went into detail on a few common attention schemes, such as scaled dot product, additive, and general attention. Right after incorporating attention into the Seq2Seq models we were playing with in Chapter 12, Building Blocks of Deep Learning for Time Series, we started with the Transformer. We went into detail on all...

References

Following is the list of the references used in this chapter:

  1. Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations. https://arxiv.org/pdf/1409.0473.pdf
  2. Thang Luong, Hieu Pham, and Christopher D. Manning (2015). Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. https://aclanthology.org/D15-1166/
  3. André F. T. Martins, Ramón Fernandez Astudillo (2016). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. In Proceedings of the 33rd International Conference on Machine Learning. http://proceedings.mlr.press/v48/martins16.html
  4. Ben Peters, Vlad Niculae, André F. T. Martins (2019). Sparse Sequence-to-Sequence Models. In Proceedings of the 57th Annual Meeting of the Association...

Further reading

Here are a few resources for further reading:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Time Series Forecasting with Python
Published in: Nov 2022Publisher: PacktISBN-13: 9781803246802
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Manu Joseph

Manu Joseph is a self-made data scientist with more than a decade of experience working with many Fortune 500 companies enabling digital and AI transformations, specifically in machine learning-based demand forecasting. He is considered an expert, thought leader, and strong voice in the world of time series forecasting. Currently, Manu leads applied research at Thoucentric, where he advances research by bringing cutting-edge AI technologies to the industry. He is also an active open-source contributor and developed an open-source library—PyTorch Tabular—which makes deep learning for tabular data easy and accessible. Originally from Thiruvananthapuram, India, Manu currently resides in Bengaluru, India, with his wife and son
Read more about Manu Joseph