reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Long-Text Alignment for Text-to-Image Diffusion Models

Authors: Luping Liu, Chao Du, Tianyu Pang, zehan wang, Chongxuan Li, Dong Xu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that segment-level encoding and training enable preference models to effectively handle long-text inputs and generate segment-level scores. Additionally, our preference decomposition method allows these models to produce T2I alignment scores alongside general preference scores. After fine-tuning the 512 512 Stable Diffusion v1.5 (Rombach et al., 2022) using Long Align for about 20 hours on 6 A100 GPUs, the obtained long Stable Diffusion (long SD) significantly improves alignment (see Figure 1), outperforming stronger foundation models in long-text alignment, such as Pix Art-α (Chen et al., 2023a) and Kandinsky v2.2 (Razzhigaev et al., 2023). Our contributions are as follows: We propose a segment-level encoding method that enables encoding models with limited input lengths to effectively process long-text inputs. We propose preference decomposition that enables preference models to produce T2I alignment scores alongside general preference, enhancing text alignment fine-tuning in generative models. After about 20 hours of fine-tuning, our long SD surpasses stronger foundation models in long-text alignment, demonstrating significant improvement potential beyond the model architecture.
Researcher Affiliation	Collaboration	Luping Liu 1,2, Chao Du 2, Tianyu Pang2, Zehan Wang 2,4, Chongxuan Li3,5, Dong Xu 1 1The University of Hong Kong; 2Sea AI Lab, Singapore; 3Renmin University of China; 4Zhejiang University; 5Beijing Key Laboratory of Big Data Management and Analysis Methods EMAIL; EMAIL; EMAIL; EMAIL; EMAIL; EMAIL
Pseudocode	Yes	C DECOMPOSED PREFERENCE OPTIMIZATION C.1 PSEUDOCODE Here, we provide the pseudocode in Algorithm 1 for the entire decomposed preference optimization pipeline discussed in this paper. Algorithm 1 Decomposed Preference Optimization for T2I Diffusion Models
Open Source Code	Yes	After fine-tuning 512 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as Pix Art-α and Kandinsky v2.2. The code is available at https://github.com/luping-liu/Long Align.
Open Datasets	Yes	For training the Unet, we utilize a dataset of approximately 2 million images, including 500k from SAM (Kirillov et al., 2023), 100k from COCO2017 (Lin et al., 2014), 500k from LLa VA (a subset of the LAION/CC/SBU dataset), and 1 million from Journey DB (Sun et al., 2024).
Dataset Splits	Yes	We randomly reserve 5k images for evaluation. All images are recaptioned using LLa VA-Next (Liu et al., 2023) or Share Captioner (Chen et al., 2023b) and resized to 512 512 pixels. We optimize the model using the Adam W optimizer with a learning rate of 3 10 5, a 2k-step warmup, and a total batch size of 192. Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours.
Hardware Specification	Yes	Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours.
Software Dependencies	No	The paper mentions using 'Adam W optimizer' and 'Uni PC' for sampling, and specific model versions like 'Stable Diffusion v1.5', 'pretrained CLIP and T5 models', 'Pickscore and HPSv2'. However, it does not provide version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages (like Python) used for implementation.
Experiment Setup	Yes	We optimize the model using the Adam W optimizer with a learning rate of 3 10 5, a 2k-step warmup, and a total batch size of 192. Training is conducted on 6 A100-40G GPUs for 30k steps over 12 hours. (2) For the reward fine-tuning (RFT) stage of the Unet, we use the same settings as before but with a batch size of 96 and 4k total training steps over 8 hours. (3) For training the segment preference model, we use the same settings as for Pickscore (Kirstain et al., 2023), employing CLIP-H on Pickscore s training data, along with LLa VA-Next captions and our new segment-level loss function. More details about training the preference model can be found in Appendix B.1.