Reader small image

You're reading from  Transformers for Natural Language Processing and Computer Vision - Third Edition

Product typeBook
Published inFeb 2024
Reading LevelN/a
PublisherPackt
ISBN-139781805128724
Edition3rd Edition
Languages
Tools
Right arrow
Author (1)
Denis Rothman
Denis Rothman
author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman

Right arrow

Transcending the Image-Text Boundary with Stable Diffusion

The essence of a diffusion model relies on the freedom it possesses to invent pixels when generating an image. From that perspective, diffusion models have taken text-to-image generation to another level. Instead of trying to create the exact representation of images they have learned, diffusion models can imagine pixels within the boundaries of the text provided.

Stability AI is a leader in Generative AI. They produced Stable Diffusion, one of the fastest-growing AI projects. From that concept, mind-blowing applications have begun to appear in every direction, including Midjourney’s application on Discord and Runway’s Gen-2 that we will encounter in Chapter 19, On the Road to Functional AGI with HuggingGPT and its Peers and Chapter 20, Beyond Human-Designed Prompts with Generative Ideation.

The goal of this chapter is not to attempt to analyze the many Stable Diffusion architectures flowing into the...

Transcending image generation boundaries

Let’s begin with a thought experiment. Imagine an art teacher telling your class of students a story about visiting a wonderful house with a big garden with old trees and beautiful flowers.

Now, the teacher gives you a piece of strange canvas with many dots (pixels of noise in an image). This mysterious piece of paper is a potential (latent) space of hidden forms you must find in your mental representation of the words (text) the teacher spoke. As you erase the dots and replace them with your ideas, you are dispersing them (diffusion). You obtain a small sketch of the objects you imagined. Your drawing is incomplete, and it’s a smaller view of what you thought. You just represented the main forms you saw. You downsampled your representation.

The fun now begins. You show each other your sketches. Although every drawing shows a house, not one is the same! Your teacher now provides incredible oil painting techniques to fill...

Part I: Defining text-to-image with Stable Diffusion

We will explore at a very low level the main Python files of the Keras version of Stable Diffusion, as shown in Figure 17.2. The complete code can be found at: https://github.com/keras-team/keras-cv/tree/master/keras_cv/models/stable_diffusion:

A diagram of a flowchart  Description automatically generated

Figure 17.2: Stable Diffusion, Keras implementation

Figure 17.2 shows the Stable Diffusion architecture of the code we will explore that can be summed up in five phases:

  1. Text embedding.
  2. Random image creation.
  3. Stable Diffusion downsampling.
  4. Decoder upsampling.
  5. Output image.

The Keras Stable Diffusion code itself is only 500 lines long!

We will describe each function’s function, make a high-level mathematical representation, and find the Python classes that execute the process.

We will end the analysis by running a Keras notebook illustrating their talented compact code approach.

1. Text embedding using a transformer...

Part II: Running text-to-image with Stable Diffusion

Stability AI is a leader in Generative AI. They produced Stable Diffusion, one of the fastest-growing AI projects. This section takes you to the forefront of Stable Diffusion. To get started, you must sign up and obtain an API key: https://platform.stability.ai/docs/getting-started/python-sdk. Check the pricing policy before running the Stable Diffusion model.

Open Stable _Vision_Stability_AI.ipynb.

We first install the SDK:

!pip install stability-sdk

Then we clone the Stability modules:

!git clone --recurse-submodules https://github.com/Stability-AI/stability-sdk

We define the Stability host, which is where to connect to the Stability API server, and our key, or define it when making the request:

!export STABILITY_HOST=grpc.stability.ai:443
#!export STABILITY_KEY=[YOUR_KEY]

We can now ask Stability to generate an image based on the following prompt:

!python -m stability_sdk generate "...

Part III: Video

Text-to-video opens new horizons for diffusion models. The models generate n frames and make incredible animations and videos.

Open Stable_Vision_Stability_AI_Animation.ipynb.

Text-to-video with Stability AI animation

First, make sure you have signed up on Stability AI and have your API key: https://platform.stability.ai/docs/features/animation.

We will now install the Stability SDK for animations:

!pip install "stability_sdk[anim_ui]"   # Install the Animation SDK
!git clone --recurse-submodules https://github.com/Stability-AI/stability-sdk

We import the API and initialize the host. We also set our API key:

from stability_sdk import api
STABILITY_HOST = "grpc.stability.ai:443"
STABILITY_KEY = [ENTER YOUR KEY HERE]
context = api.Context(STABILITY_HOST, STABILITY_KEY)

We now import the modules and configure the parameters. The following code uses the default Stability AI arguments:

from stability_sdk.animation...

Summary

Stable Diffusion has transcended the boundaries of classical AI imagery. Introducing creative freedom (“noise”) through diffusion in a latent space has opened the doors to huge generative computer vision possibilities.

We began the chapter by going through the Stable Diffusion process with a thought experiment and then with the talented Keras implementation. We went through the encoding of a contextualized input text, introducing a “noisy” (open to creativity) image patch, applying diffusion to this image to reduce (downsampling) it to a lower dimension, and then upsampling it to a 512x512 image patch. The output was astonishing for such a compact source code.

We then ran a Stability AI text-to-image notebook that also generated surprising images. We once again saw that diffusion is taking us to levels we never would have imagined including divergent association tasks.

Stability AI also provided a text-to-animation API to transform one...

Questions

  1. Stable Diffusion requires a text encoder. (True/False)
  2. Stable Diffusion requires diffusion layers. (True/False)
  3. A Keras Stable Diffusion reduces a noisy image to a lower dimensionality. (True/False)
  4. A Keras Stable Diffusion model upsamples an image once it is downsampled. (True/False)
  5. The final output of a diffusion model is a “noisy” image. (True/False)
  6. OpenAI CLIP cannot produce a text-to-video model yet. (True/False)
  7. Stability AI cannot convert one image to another in a video. (True/False)
  8. Meta’s TimeSformer is a scheduling algorithm, not a computer vision model. (True/False)
  9. It will never be possible to create a complete movie automatically. (True/False)
  10. There is a hardware limit to generate videos automatically beyond 10 seconds. (True/False)

Further reading

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers:

https://www.packt.link/Transformers

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Transformers for Natural Language Processing and Computer Vision - Third Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781805128724
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Denis Rothman

Denis Rothman graduated from Sorbonne University and Paris-Diderot University, designing one of the very first word2matrix patented embedding and patented AI conversational agents. He began his career authoring one of the first AI cognitive Natural Language Processing (NLP) chatbots applied as an automated language teacher for Moet et Chandon and other companies. He authored an AI resource optimizer for IBM and apparel producers. He then authored an Advanced Planning and Scheduling (APS) solution used worldwide.
Read more about Denis Rothman