Dual Caption Preference Optimization for Diffusion Models

Authors: Amir Saeidi, Yiran Lawrence Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFTChosen, Diffusion-DPO and Ma PO across multiple metrics, including Pickscore, HPSv2.1, Gen Eval, CLIPscore, and Image Reward, fine-tuned on SD 2.1 as the backbone.
Researcher Affiliation Academia Amir Saeidi EMAIL School of Computing and Augmented Intelligence Arizona State University
Pseudocode No The paper includes formulas and proofs but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block for the methodology described. Figure 12 provides a 'Sample source code' snippet but it is actual code, not pseudocode for the main algorithm.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We also construct Pick-Double Caption, a modified version of Pick-a-Pic v2 with separate captions for each image, and propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. ... To optimize a diffusion model using DCPO, a dataset D = {zw, zl, xw 0 , xl 0} is required, where captions are paired with the images. However, the current preference dataset only contains prompts c and image pairs without captions. To address this, we propose three methods for generating captions z and introduce a new high-quality dataset, Pick-Double Caption, which provides specific captions for each image, based on Pick-a-Pic v2 (Kirstain et al., 2023).
Dataset Splits No The paper mentions sampling 20,000 instances from Pick-a-Pic v2 and evaluating on 2,500 unique prompts from Pick-a-Picv2 and 3,200 prompts from HPSv2. However, it does not provide specific training, validation, and test dataset splits (e.g., percentages or exact counts for each split) for its own 'Pick-Double Caption' dataset.
Hardware Specification Yes captioning 20,000 images using LLa VA requires less than 12 hours on a single A100:80G GPU... We fine-tuning methods and hyperparameter searches conducted under a unified SD 2.1 setup using eight A100 80 GB GPUs.
Software Dependencies No The paper mentions models and frameworks like 'DIPPER (T5-XXL)', 'LLa VA (Liu et al., 2024a)', 'Emu2 (Sun et al., 2024)', and 'Stable Diffusion (SD) 2.1'. Figure 12 shows a code snippet using 'transformers', 'T5Tokenizer', and 'T5ForConditionalGeneration'. However, it does not provide specific version numbers for these software components or any other ancillary libraries required for full reproducibility.
Experiment Setup Yes We fine-tune SD 2.1 for 2,000 training steps with each method. The strongest configuration uses batch size 64 with learning rate 10 9. ... We train SD 2.1 on triples {c, xw, xl} with batch size fixed to 128 and learning rate fixed to 10 8. We sweep the regularization coefficient β over {500, 1000, 2000, 2500, 5000}.