Text-to-Image Generation Via Energy-Based CLIP

Authors: Roy Ganz, Michael Elad

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train different variants of CLIP using Algorithm 1 including Vi T-B/32 and Conv Next in base, large, and XXL configurations on the extensive image-caption Data Comp dataset (Gadre et al., 2024) for 20,000 steps. Throughout the training process, we keep the text encoder frozen and update solely the vision encoder. Implementation and training details are provided in the supplementary materials. To analyze the performance of CLIP-JEM, we first evaluate it in the text-to-image generation setting Section 4.1). Next, we demonstrate its effectiveness as a guiding model (Section 4.2). Lastly, we show that CLIP-JEM can serve as an improved evaluation metric compared to the vanilla CLIP, attributed to its robustness and awareness of perceptual quality.
Researcher Affiliation Academia Roy Ganz EMAIL Electrical Engineering Department Technion Michael Elad EMAIL Computer Science Department Technion
Pseudocode Yes CLIP-JEM s objective is the combination of contrastive adversarial and contrastive energy losses, and the overall training protocol is described in the supplementary materials (algorithm 1). ... Algorithm 1 CLIP-JEM Training. Given CLIP image and text encoders f I θ ( ) and f T θ ( ), image-text dataset D, adversarial budget ϵ, adversarial and energy step-sizes α1, α2, energy loss coefficient γ, and number of adversarial and generation iterations Tadv, TJEM: while not converged do Sample (I, T) from dataset D /* Contrastive adversarial loss */ δ0 0 for t from 0 to Tadv do δt+1 = Πϵ(δt + α1 Clip Loss(f I θ (I + δt), f T θ (T))) end Iadv = I + δTadv Ladv = Clip Loss(f I θ (Iadv), f T θ (T)) /* Contrastive energy loss */ Sample initial negative sample I. Optimizer Adam W(params = I, lr = α2) for t from 0 to TJEM do LJEM = Clip Loss(f I θ ( I + βn), f T θ (T)) /* n N(0, I), β is a small scalar */ Calculate LJEM/ I and perform an optimizer step end LJEM = Clip Loss(f I θ (Concat(I, I)), f T θ (T)) /* Update the vision encoder */ L = Ladv + γ LJEM Calculate L/ θ and update CLIP image encoder f I θ ( ) end
Open Source Code No We aim to make our code and pretrained models publicly available upon acceptance.
Open Datasets Yes We train different variants of CLIP using Algorithm 1 including Vi T-B/32 and Conv Next in base, large, and XXL configurations on the extensive image-caption Data Comp dataset (Gadre et al., 2024) for 20,000 steps. ... We use CLIP-JEM to generate 30,000 samples from the MS-COCO dataset (Lin et al., 2015) and report the results in FID2 and CLIPSIM using Vi T-B/32 in Table 1.
Dataset Splits Yes Following Comp Bench procedure, we generate 10 samples per prompt and average the results on the validation sets of each category, using the same evaluation metrics as in the original paper (B-VQA, Uni Det, CLIP, and 3-in-1 for the attribute binding, spatial, non-spatial, and complex categories).
Hardware Specification Yes To train our models, we use 8 A40 GPUs for training. ... We report the time of our generation process using a batch size of 1 using an Nvidia A40 GPU in table 5.
Software Dependencies No We implement our method upon the Open Clip codebase3. ... The text mentions a codebase but does not provide specific version numbers for any software components.
Experiment Setup Yes Arch. BS disc. BS gen. #Steps LR WD Sched. Warmup Adv. ϵ Tadv TJEM γ α1 α2 Vi T-B/32 256 32 1 10 4 Cosine 200 3.0 5 50 0.1 1.5 0.025 Conv Next-B 128 32 2 10 5 Conv Next-L 128 16 2 10 5 Conv Next-XXL 32 8 2 10 6 Table 4: Implementation details. We provide the training hyperparameters of CLIP-JEM for the different architectures (BS disc. and gen. stands for the discriminative and generative batch sizes, respectively).