Dual Caption Preference Optimization for Diffusion Models
Authors: Amir Saeidi, Yiran Lawrence Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFTChosen, Diffusion-DPO and Ma PO across multiple metrics, including Pickscore, HPSv2.1, Gen Eval, CLIPscore, and Image Reward, fine-tuned on SD 2.1 as the backbone. |
| Researcher Affiliation | Academia | Amir Saeidi EMAIL School of Computing and Augmented Intelligence Arizona State University |
| Pseudocode | No | The paper includes formulas and proofs but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block for the methodology described. Figure 12 provides a 'Sample source code' snippet but it is actual code, not pseudocode for the main algorithm. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We also construct Pick-Double Caption, a modified version of Pick-a-Pic v2 with separate captions for each image, and propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. ... To optimize a diffusion model using DCPO, a dataset D = {zw, zl, xw 0 , xl 0} is required, where captions are paired with the images. However, the current preference dataset only contains prompts c and image pairs without captions. To address this, we propose three methods for generating captions z and introduce a new high-quality dataset, Pick-Double Caption, which provides specific captions for each image, based on Pick-a-Pic v2 (Kirstain et al., 2023). |
| Dataset Splits | No | The paper mentions sampling 20,000 instances from Pick-a-Pic v2 and evaluating on 2,500 unique prompts from Pick-a-Picv2 and 3,200 prompts from HPSv2. However, it does not provide specific training, validation, and test dataset splits (e.g., percentages or exact counts for each split) for its own 'Pick-Double Caption' dataset. |
| Hardware Specification | Yes | captioning 20,000 images using LLa VA requires less than 12 hours on a single A100:80G GPU... We fine-tuning methods and hyperparameter searches conducted under a unified SD 2.1 setup using eight A100 80 GB GPUs. |
| Software Dependencies | No | The paper mentions models and frameworks like 'DIPPER (T5-XXL)', 'LLa VA (Liu et al., 2024a)', 'Emu2 (Sun et al., 2024)', and 'Stable Diffusion (SD) 2.1'. Figure 12 shows a code snippet using 'transformers', 'T5Tokenizer', and 'T5ForConditionalGeneration'. However, it does not provide specific version numbers for these software components or any other ancillary libraries required for full reproducibility. |
| Experiment Setup | Yes | We fine-tune SD 2.1 for 2,000 training steps with each method. The strongest configuration uses batch size 64 with learning rate 10 9. ... We train SD 2.1 on triples {c, xw, xl} with batch size fixed to 128 and learning rate fixed to 10 8. We sweep the regularization coefficient β over {500, 1000, 2000, 2500, 5000}. |