Reader small image

You're reading from  Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781805128724
Edition3rd Edition
Languages
Tools
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

Beyond Text: Vision Transformers in the Dawn of Revolutionary AI

Up to now, we have examined variations of the Original Transformer model with encoder and decoder layers. We have also explored other models with encoder-only or decoder-only stacks of layers. Also, the size of the layers and parameters has increased. However, the fundamental architecture of the Transformer retains its original structure with identical layers and the parallelization of the computing of the attention heads.

In this chapter, we will explore the innovative transformer models that respect the basic structure of the Original Transformer but make some significant changes. Scores of transformer models will appear, like the many possibilities a box of LEGO© pieces gives. You can assemble those pieces in hundreds of ways! Transformer model sublayers and layers are LEGO© pieces of advanced AI.

We will discover powerful computer vision transformers like ViT, CLIP, DALL-E, and GPT-4V. We can add...

From task-agnostic models to multimodal vision transformers

Foundation Models, as we saw in Chapter 1, What Are Transformers?, have two distinct and unique properties:

  • Emergence: Transformer models that qualify as Foundation Models can perform tasks they were not trained for. They are large models trained on supercomputers. They are not trained to learn specific tasks like many other models. Foundation Models learn how to understand sequences.
  • Homogenization: The same model can be used across many domains with the same fundamental architecture. Foundation Models can learn new skills through data faster and better than any other model.

OpenAI ChatGPT models (GPT 3 and GPT 4), Google PaLM, and Google BERT (only the BERT models trained by Google) are task-agnostic Foundation Models. These task-agnostic models lead directly to ViT, CLIP, and DALL-E models.

The level of abstraction of transformer models leads to multimodal neurons. Multimodal neurons can process...

ViT – Vision Transformer

Dosovitskiy et al. (2021) summed up the essence of the vision transformer architecture they designed in the title of their paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

The paper’s title sums the process up: an image can be converted into patches of 16x16 words.

Let’s first go through a high-level view of the architecture of ViT before looking into the code.

The basic architecture of ViT

ViT can process an image as patches of words. In this section, we will go through the process in three steps:

  1. Splitting the image into patches with a feature extractor.
  2. Building a vocabulary of image patches with the feature extractor.
  3. The patches become the input of the transformer encoder-only model. The model embeds the input. It will produce an output of raw logits that the pipeline functions will convert into the final probabilities.

The first step is to SPLIT the...

CLIP

Contrastive Language-Image Pre-Training (CLIP) is a multimodal transformer that can be used for image classification. CLIP’s process can be summed up as follows:

  • A feature extractor, like ViT, produces image tokens.
  • Text also is an input in tokens, as in ViT
  • The attention layer learns the relationships between the image tokens and the text tokens with some form of “cross-attention.”
  • The output is also raw logits, as in ViT.

We will first look into the basic architecture of CLIP before running CLIP in code.

The basic architecture of CLIP

The model is contrastive: the images are trained to learn how they fit together through their differences and similarities. The image and captions find their way toward each other through (joint text, image) pretraining. After pretraining, CLIP learns new tasks.

CLIP is transferable because it can learn new visual concepts, like GPT models, such as action recognition in video...

DALL-E 2 and DALL-E 3

DALL-E, as with CLIP, is a multimodal model. CLIP processes text-image pairs. DALL-E processes the text and image tokens differently. DALL-E 1 has an input of a single stream of text and image of 1,280 tokens. 256 tokens are for the text, and 1,024 tokens are used for the image.

DALL-E was named after Salvador Dali and Pixar’s WALL-E. The usage of DALL-E is to enter a text prompt and produce an image. However, DALL-E must first learn how to generate images with text.

This transformer generates images from text descriptions using a dataset of text-image pairs.

We will go through the basic architecture of DALL-E to see how the model works.

The basic architecture of DALL-E

Unlike CLIP, DALL-E concatenates up to 256 BPE-encoded text tokens with 32×32 = 1,024 image tokens, as shown in Figure 16.11:

A screenshot of a video game  Description automatically generated with medium confidence

Figure 16.11: DALL-E concatenates text and image input

Figure 16.11 shows that, this time, our cat image is concatenated with...

GPT-4V, DALL-E 3, and divergent semantic association

How many humans are creative with a “big C?” How many humans are outperformed by GPT-4 for “little C” everyday life creativity? Chen et al. (2023) produced a ground-breaking experiment in their paper. The full title is worth meditating on: Probing the “Creativity” of Large Language Models: Can Models Produce Divergent Semantic Association?

Divergent semantic association is the ability of a creative mind to find symbols (words, images, and sounds) that diverge from a standard that defines creativity. Creativity is something new that GPT-4 can now perform.

“Big C” creativity is rare for humans. “Big C” creativity, such as that of Mozart, Einstein, or Picasso, is a one-in-a-million phenomenon.

However, “little C” creativity is relatively common when we adapt to new situations or produce new little artifacts. Chen et al. (2023) collected...

Summary

Natural language transformers have evolved into Foundation Models in a short time. Generative AI has reached new levels with ViT, CLIP, DALL-E, and GPT-4V.

We first explored the architecture of ViT, which breaks images down into words. We discovered that there is more than one way to implement models in real-world ML. Understanding the different approaches contributes to creating a personal toolbox to solve problems when implementing ML projects.

Then, we explored CLIP, which can associate words and images. Finally, we looked into the architecture of DALL-E. We went down to the tensor level to look under the hood of the structure of some of these innovative models. We then implemented the innovative DALL-E 2 and DALL-E 3 API.

Finally, we built a GPT-4V notebook with DALL-E 3 images, implementing an example of divergent semantic association.

The paradigm shift resides in the tremendous resources few organizations have to train Generative AI models on petabytes...

Questions

  1. DALL-E 2 classifies images. (True/False)
  2. ViT classifies images. (True/False)
  3. BERT was initially designed to generate images. (True/False)
  4. CLIP is an image-clipping application. (True/False)
  5. BERT uses CLIP to identify images. (True/False)
  6. DALL-E 3 cannot be accessed with an API. (True/False)
  7. Gradio is a transformer model. (True/False)
  8. ViT can classify images that are not on its list of labels. (True/False)
  9. ViT requires a prompt to respond. (True/False)
  10. GPT-4V will most probably evolve into a more multimodal system. (True/False)

References

Further Reading

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://www.packt.link/Transformers

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing and Computer Vision - Third Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781805128724
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman