Improving Long-Text Alignment for Text-to-Image Diffusion Models
Authors: Luping Liu, Chao Du, Tianyu Pang, zehan wang, Chongxuan Li, Dong Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that segment-level encoding and training enable preference models to effectively handle long-text inputs and generate segment-level scores. Additionally, our preference decomposition method allows these models to produce T2I alignment scores alongside general preference scores. After fine-tuning the 512 512 Stable Diffusion v1.5 (Rombach et al., 2022) using Long Align for about 20 hours on 6 A100 GPUs, the obtained long Stable Diffusion (long SD) significantly improves alignment (see Figure 1), outperforming stronger foundation models in long-text alignment, such as Pix Art-α (Chen et al., 2023a) and Kandinsky v2.2 (Razzhigaev et al., 2023). Our contributions are as follows: We propose a segment-level encoding method that enables encoding models with limited input lengths to effectively process long-text inputs. We propose preference decomposition that enables preference models to produce T2I alignment scores alongside general preference, enhancing text alignment fine-tuning in generative models. After about 20 hours of fine-tuning, our long SD surpasses stronger foundation models in long-text alignment, demonstrating significant improvement potential beyond the model architecture. |
| Researcher Affiliation | Collaboration | Luping Liu 1,2, Chao Du 2, Tianyu Pang2, Zehan Wang 2,4, Chongxuan Li3,5, Dong Xu 1 1The University of Hong Kong; 2Sea AI Lab, Singapore; 3Renmin University of China; 4Zhejiang University; 5Beijing Key Laboratory of Big Data Management and Analysis Methods EMAIL; EMAIL; EMAIL; EMAIL; EMAIL; EMAIL |
| Pseudocode | Yes | C DECOMPOSED PREFERENCE OPTIMIZATION C.1 PSEUDOCODE Here, we provide the pseudocode in Algorithm 1 for the entire decomposed preference optimization pipeline discussed in this paper. Algorithm 1 Decomposed Preference Optimization for T2I Diffusion Models |
| Open Source Code | Yes | After fine-tuning 512 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as Pix Art-α and Kandinsky v2.2. The code is available at https://github.com/luping-liu/Long Align. |
| Open Datasets | Yes | For training the Unet, we utilize a dataset of approximately 2 million images, including 500k from SAM (Kirillov et al., 2023), 100k from COCO2017 (Lin et al., 2014), 500k from LLa VA (a subset of the LAION/CC/SBU dataset), and 1 million from Journey DB (Sun et al., 2024). |
| Dataset Splits | Yes | We randomly reserve 5k images for evaluation. All images are recaptioned using LLa VA-Next (Liu et al., 2023) or Share Captioner (Chen et al., 2023b) and resized to 512 512 pixels. We optimize the model using the Adam W optimizer with a learning rate of 3 10 5, a 2k-step warmup, and a total batch size of 192. Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours. |
| Hardware Specification | Yes | Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer' and 'Uni PC' for sampling, and specific model versions like 'Stable Diffusion v1.5', 'pretrained CLIP and T5 models', 'Pickscore and HPSv2'. However, it does not provide version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages (like Python) used for implementation. |
| Experiment Setup | Yes | We optimize the model using the Adam W optimizer with a learning rate of 3 10 5, a 2k-step warmup, and a total batch size of 192. Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours. (2) For the reward fine-tuning (RFT) stage of the Unet, we use the same settings as before but with a batch size of 96 and 4k total training steps over 8 hours. (3) For training the segment preference model, we use the same settings as for Pickscore (Kirstain et al., 2023), employing CLIP-H on Pickscore s training data, along with LLa VA-Next captions and our new segment-level loss function. More details about training the preference model can be found in Appendix B.1. |