Optimization solution 3 – enabling Xformers or using PyTorch 2.0
When we provide a text or prompt to generate an image, the encoded text embedding will be fed to the Transformer multi-header attention component of the diffusion UNet.
Inside the Transformer block, the self-attention and cross-attention headers will try to compute the attention score (via the QKV operation). This is computation-heavy and will also use a lot of memory.
The open source Xformers [2] package from Meta Research is built to optimize the process. In short, the main differences between Xformers and standard Transformers are as follows:
- Hierarchical attention mechanism: Xformers use a hierarchical attention mechanism, which consists of two layers of attention: a coarse layer and a fine layer. The coarse layer attends to the input sequence at a high level, while the fine layer attends to the input sequence at a low level. This allows Xformers to learn long-range dependencies in the input sequence...
 
                                             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
             
     
         
                 
                 
                 
                 
                 
                 
                 
                 
                