Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
The Deep Learning Architect's Handbook

You're reading from  The Deep Learning Architect's Handbook

Product type Book
Published in Dec 2023
Publisher Packt
ISBN-13 9781803243795
Pages 516 pages
Edition 1st Edition
Languages
Author (1):
Ee Kin Chin Ee Kin Chin
Profile icon Ee Kin Chin

Table of Contents (25) Chapters

Preface Part 1 – Foundational Methods
Chapter 1: Deep Learning Life Cycle Chapter 2: Designing Deep Learning Architectures Chapter 3: Understanding Convolutional Neural Networks Chapter 4: Understanding Recurrent Neural Networks Chapter 5: Understanding Autoencoders Chapter 6: Understanding Neural Network Transformers Chapter 7: Deep Neural Architecture Search Chapter 8: Exploring Supervised Deep Learning Chapter 9: Exploring Unsupervised Deep Learning Part 2 – Multimodal Model Insights
Chapter 10: Exploring Model Evaluation Methods Chapter 11: Explaining Neural Network Predictions Chapter 12: Interpreting Neural Networks Chapter 13: Exploring Bias and Fairness Chapter 14: Analyzing Adversarial Performance Part 3 – DLOps
Chapter 15: Deploying Deep Learning Models to Production Chapter 16: Governing Deep Learning Models Chapter 17: Managing Drift Effectively in a Dynamic Environment Chapter 18: Exploring the DataRobot AI Platform Chapter 19: Architecting LLM Solutions Index Other Books You May Enjoy

Understanding Neural Network Transformers

Not to be confused with the electrical devices that are also called transformers, neural network transformers are the jack-of-all-trades variant of NNs. Transformers are capable of processing and capturing patterns from data of any modality, including sequential data such as text data and time-series data, image data, audio data, and video data.

The transformer architecture was introduced in 2017 with the motive of replacing RNN-based sequence-to-sequence architectures and primarily focusing on the machine translation use case of converting text data from one language to another language. The results performed better than the baseline RNN-based model and proved that we don’t need inherent inductive biases on the sequential nature of the data that the RNNs employ. Transformers then became the root of a family of neural network architectures and branched off to model variants that are capable of capturing patterns in other data modalities...

Exploring neural network transformers

Figure 6.1 provides an overview of the impact transformers have had, thanks to the plethora of transformer model variants.

Figure 6.1 – Transformers’ different modality and model branches

Figure 6.1 – Transformers’ different modality and model branches

The transformer does not have inherent inductive bias structurally designed into its architecture. Inductive bias refers to the pre-assumptions made by a learning algorithm on the data. This bias can be built into the model architecture or the learning process, and it helps to guide the model toward learning specific patterns or structures in the data. Traditional models, such as RNNs, incorporate inductive bias through their design, for instance, by assuming that data has a sequential structure and that the order of elements is important. Another example is CNN models, which are specifically designed for processing grid-like data, such as images, by incorporating inductive bias in the form of local connectivity and...

Decoding the original transformer architecture holistically

Before we look into the structure of the model, let’s talk about the basic intent of transformers.

As we covered in the previous chapter, transformers are also a family of architectures that utilize the concept of encoder and decoder. The encoder encodes data into what is known as the code and the decoder decodes the code into a data format that looks similar to raw, unprocessed data. The very first transformer used both the encoder and decoder concepts to build the entire architecture and demonstrated its application in text generation. The subsequent adaptations and improvements either used only the encoder or only the decoder to achieve different tasks. In a transformer, however, the encoder’s goal is not to compress the data to achieve a smaller and more compact representation of the data, but instead mainly to serve as a feature extractor. Additionally, the decoder’s goal for transformers is not...

Uncovering transformer improvements using only the encoder

The first type of architectural advancements based on transformers we will discuss are transformers that utilize only the encoder part of the original transformer using the same multi-head attention layer. The encoder-only line of transformers is adopted generally because there is no masked multi-head attention layer since the next token prediction training setup is not used. In this line of improvements, training goals and setups vary across different data modalities and vary slightly for sequential improvements under the same data modality. However, one concept that stays pretty much constant across different data modalities is the fact that a semi-supervised learning method is used. In the case of transformers, this means that a form of unsupervised learning is executed first and then the straightforward supervised learning method is executed next. Unsupervised learning offers transformers a way to initialize their state...

Uncovering transformer improvements using only the decoder

Recall that the decoder block of the transformer focuses on an autoregressive structure. For the decoder-only transformer line of models, the task of predicting tokens autoregressively remains the same. With the removal of the encoder, the architecture has to adapt its input to accept more than one sentence, similar to what BERT does. Starting, ending, and separator tokens are used to encode input data sequentially. Masking is still performed to prevent the model from depending on the current token to predict future tokens from the input data during predictions, which is similar to the original transformer along with positional embeddings.

Diving into the GPT model family

All these architectural concepts were introduced by the GPT model in 2018, which is short for generative pre-training. As the name suggests, GPT also adopts unsupervised pre-training as the initial stage and subsequently moves into the supervised fine...

Summary

Transformers are versatile NNs capable of capturing relationships of any data modality without explicit data-specific biases in the architecture. Instead of a neural network architecture capable of ingesting different data modalities directly, careful considerations of the data input structure along with crafting proper training task objectives are needed to successfully build a performant transformer. The benefits of pre-training still hold true even for the current SOTA architecture. The act of pre-training is part of a concept called transfer learning, which will be covered more extensively in the supervised and unsupervised learning chapters. Transformers can currently perform both data generation and supervised learning tasks in general with more and more research experimenting with using transformers in unexplored niche tasks and data modalities. Look forward to more deep learning innovations in the coming years with transformers being at the forefront of the advancement...

lock icon The rest of the chapter is locked
You have been reading a chapter from
The Deep Learning Architect's Handbook
Published in: Dec 2023 Publisher: Packt ISBN-13: 9781803243795
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}