CLIP is trained to understand the relationship between text and images by learning to place similar images and text near each other in a shared space. When evaluating a generated image, CLIP checks how closely the image aligns with the textual description provided. A higher score indicates a better match, meaning the image accurately represents the text. Conversely, a lower score suggests a deviation from the text, indicating a lesser quality or fidelity to the prompt, providing a quantitative measure of how well the generated image adheres to theintended description.
Again, we will import thenecessary libraries:
from typing import List, Tuple
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import torch
We begin by loading the CLIP model, processor, andnecessary parameters:
# Constants
CLIP_REPO = "openai/clip-vit-base-patch32"
def load_model_and_processor(
model_name: str
) -> Tuple[CLIPModel, CLIPProcessor]:
"""
Loads the CLIP model and processor.
"""
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)
return model, processor
Next, we define a processing function to adjust the textual prompts and images, ensuring that they are in the correct format forCLIP inference:
def process_inputs(
processor: CLIPProcessor, prompts: List[str],
images: List[Image.Image]) -> dict:
"""
Processes the inputs using the CLIP processor.
"""
return processor(text=prompts, images=images,
return_tensors="pt", padding=True)
In this step, we initiate the evaluation process by inputting the images and textual prompts into the CLIP model. This is done in parallel across multiple devices to optimize performance. The model then computes similarity scores, known as logits, for each image-text pair. These scores indicate how well each image corresponds to the text prompts. To interpret these scores more intuitively, we convert them into probabilities, which indicate the likelihood that an image aligns with any of thegiven prompts:
def get_probabilities(
model: CLIPModel, inputs: dict) -> torch.Tensor:
"""
Computes the probabilities using the CLIP model.
"""
outputs = model(**inputs)
logits = outputs.logits_per_image
# Define temperature - higher temperature will make the distribution more uniform.
T = 10
# Apply temperature to the logits
temp_adjusted_logits = logits / T
probs = torch.nn.functional.softmax(
temp_adjusted_logits, dim=1)
return probs
Lastly, we display the images along with their scores, visually representing how well each image adheres to theprovided prompts:
def display_images_with_scores(
images: List[Image.Image], scores: torch.Tensor) -> None:
"""
Displays the images alongside their scores.
"""
# Set print options for readability
torch.set_printoptions(precision=2, sci_mode=False)
for i, image in enumerate(images):
print(f"Image {i + 1}:")
display(image)
print(f"Scores: {scores[i, :]}")
print()
With everything detailed, let’s execute the pipelineas follows:
# Load CLIP model
model, processor = load_model_and_processor(CLIP_REPO)
# Process image and text inputs together
inputs = process_inputs(processor, prompts, images)
# Extract the probabilities
probs = get_probabilities(model, inputs)
# Display each image with corresponding scores
display_images_with_scores(images, probs)
We now have scores for each of our synthetic images that quantify the fidelity of the synthetic image to the text provided, based on the CLIP model, which interprets both image and text data as one combined mathematical representation (or geometric space) and can measuretheir similarity.