You're reading from Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook

Published inFeb 2024

Reading LevelN/a

PublisherPackt

ISBN-139781805128724

Edition3rd Edition

Languages

Python

Tools

PyTorch

Concepts

Deep Learning

Author (1)

Denis Rothman

Beyond Text: Vision Transformers in the Dawn of Revolutionary AI

Up to now, we have examined variations of the Original Transformer model with encoder and decoder layers. We have also explored other models with encoder-only or decoder-only stacks of layers. Also, the size of the layers and parameters has increased. However, the fundamental architecture of the Transformer retains its original structure with identical layers and the parallelization of the computing of the attention heads.

In this chapter, we will explore the innovative transformer models that respect the basic structure of the Original Transformer but make some significant changes. Scores of transformer models will appear, like the many possibilities a box of LEGO^© pieces gives. You can assemble those pieces in hundreds of ways! Transformer model sublayers and layers are LEGO^© pieces of advanced AI.

We will discover powerful computer vision transformers like ViT, CLIP, DALL-E, and GPT-4V. We can add...

From task-agnostic models to multimodal vision transformers

Foundation Models, as we saw in Chapter 1, What Are Transformers?, have two distinct and unique properties:

Emergence: Transformer models that qualify as Foundation Models can perform tasks they were not trained for. They are large models trained on supercomputers. They are not trained to learn specific tasks like many other models. Foundation Models learn how to understand sequences.
Homogenization: The same model can be used across many domains with the same fundamental architecture. Foundation Models can learn new skills through data faster and better than any other model.

OpenAI ChatGPT models (GPT 3 and GPT 4), Google PaLM, and Google BERT (only the BERT models trained by Google) are task-agnostic Foundation Models. These task-agnostic models lead directly to ViT, CLIP, and DALL-E models.

The level of abstraction of transformer models leads to multimodal neurons. Multimodal neurons can process...

ViT – Vision Transformer

Dosovitskiy et al. (2021) summed up the essence of the vision transformer architecture they designed in the title of their paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

The paper’s title sums the process up: an image can be converted into patches of 16x16 words.

Let’s first go through a high-level view of the architecture of ViT before looking into the code.

The basic architecture of ViT

ViT can process an image as patches of words. In this section, we will go through the process in three steps:

Splitting the image into patches with a feature extractor.
Building a vocabulary of image patches with the feature extractor.
The patches become the input of the transformer encoder-only model. The model embeds the input. It will produce an output of raw logits that the pipeline functions will convert into the final probabilities.

The first step is to SPLIT the...

CLIP

Contrastive Language-Image Pre-Training (CLIP) is a multimodal transformer that can be used for image classification. CLIP’s process can be summed up as follows:

A feature extractor, like ViT, produces image tokens.
Text also is an input in tokens, as in ViT
The attention layer learns the relationships between the image tokens and the text tokens with some form of “cross-attention.”
The output is also raw logits, as in ViT.

We will first look into the basic architecture of CLIP before running CLIP in code.

The basic architecture of CLIP

The model is contrastive: the images are trained to learn how they fit together through their differences and similarities. The image and captions find their way toward each other through (joint text, image) pretraining. After pretraining, CLIP learns new tasks.

CLIP is transferable because it can learn new visual concepts, like GPT models, such as action recognition in video...

DALL-E 2 and DALL-E 3

DALL-E, as with CLIP, is a multimodal model. CLIP processes text-image pairs. DALL-E processes the text and image tokens differently. DALL-E 1 has an input of a single stream of text and image of 1,280 tokens. 256 tokens are for the text, and 1,024 tokens are used for the image.

DALL-E was named after Salvador Dali and Pixar’s WALL-E. The usage of DALL-E is to enter a text prompt and produce an image. However, DALL-E must first learn how to generate images with text.

This transformer generates images from text descriptions using a dataset of text-image pairs.

We will go through the basic architecture of DALL-E to see how the model works.

The basic architecture of DALL-E

Unlike CLIP, DALL-E concatenates up to 256 BPE-encoded text tokens with 32×32 = 1,024 image tokens, as shown in Figure 16.11:

A screenshot of a video game Description automatically generated with medium confidence

Figure 16.11: DALL-E concatenates text and image input

Figure 16.11 shows that, this time, our cat image is concatenated with...

GPT-4V, DALL-E 3, and divergent semantic association

How many humans are creative with a “big C?” How many humans are outperformed by GPT-4 for “little C” everyday life creativity? Chen et al. (2023) produced a ground-breaking experiment in their paper. The full title is worth meditating on: Probing the “Creativity” of Large Language Models: Can Models Produce Divergent Semantic Association?

Divergent semantic association is the ability of a creative mind to find symbols (words, images, and sounds) that diverge from a standard that defines creativity. Creativity is something new that GPT-4 can now perform.

“Big C” creativity is rare for humans. “Big C” creativity, such as that of Mozart, Einstein, or Picasso, is a one-in-a-million phenomenon.

However, “little C” creativity is relatively common when we adapt to new situations or produce new little artifacts. Chen et al. (2023) collected...

Summary

Natural language transformers have evolved into Foundation Models in a short time. Generative AI has reached new levels with ViT, CLIP, DALL-E, and GPT-4V.

We first explored the architecture of ViT, which breaks images down into words. We discovered that there is more than one way to implement models in real-world ML. Understanding the different approaches contributes to creating a personal toolbox to solve problems when implementing ML projects.

Then, we explored CLIP, which can associate words and images. Finally, we looked into the architecture of DALL-E. We went down to the tensor level to look under the hood of the structure of some of these innovative models. We then implemented the innovative DALL-E 2 and DALL-E 3 API.

Finally, we built a GPT-4V notebook with DALL-E 3 images, implementing an example of divergent semantic association.

The paradigm shift resides in the tremendous resources few organizations have to train Generative AI models on petabytes...

Questions

DALL-E 2 classifies images. (True/False)
ViT classifies images. (True/False)
BERT was initially designed to generate images. (True/False)
CLIP is an image-clipping application. (True/False)
BERT uses CLIP to identify images. (True/False)
DALL-E 3 cannot be accessed with an API. (True/False)
Gradio is a transformer model. (True/False)
ViT can classify images that are not on its list of labels. (True/False)
ViT requires a prompt to respond. (True/False)
GPT-4V will most probably evolve into a more multimodal system. (True/False)

References

Alexey Dosovitskiy et al., 2020, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929
CLIP: https://github.com/openai/CLIP
DALL-E: https://openai.com/blog/dall-e/
OpenAI resources: https://openai.com
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever, 2021, Zero-Shot Text-to-Image Generation: https://arxiv.org/abs/2102.12092
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017, Attention Is All You Need, https://arxiv.org/abs/1706.03762
GPT-4V system card: https://openai.com/research/gpt-4v-system-card
Divergent semantic association: Honghua Chen and Nai Ding, 2023, Probing the Creativity of Large Language Models: Can models produce divergent semantic association?: https://arxiv.org/abs/2310.11158

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Transformers for Natural Language Processing and Computer Vision - Third Edition

Beyond Text: Vision Transformers in the Dawn of Revolutionary AI

From task-agnostic models to multimodal vision transformers

ViT – Vision Transformer

The basic architecture of ViT

CLIP

The basic architecture of CLIP

DALL-E 2 and DALL-E 3

The basic architecture of DALL-E

GPT-4V, DALL-E 3, and divergent semantic association

Summary

Questions

References

Further Reading

Join our community on Discord

Author (1)