reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present quantitative evaluation results for numerical and spatial composition in Tables 1 and 2. We calculate the output accuracy, the ratio of generated images correctly aligning with the text prompts. For numerical composition evaluation, we also compute the Mean Absolute Error (MAE) between the actual generated quantity and the specified quantity. We employ GPT-4o to determine the actual quantities or spatial relations in generated images (details in Appendix A.3.2).
Researcher Affiliation	Academia	Shuangqi Li1, Hieu Le1, Jingyi Xu2, Mathieu Salzmann1 1EPFL, Switzerland 2Stony Brook University, USA EMAIL EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and figures but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code is available at https://github.com/doub7e/Reliable-Random-Seeds.
Open Datasets	No	The paper introduces a new dataset called 'Comp90 dataset' which it creates for its experiments. While the paper describes the composition of this dataset in detail in Appendix A.1, it does not provide a specific link, DOI, or repository for public access to the dataset itself.
Dataset Splits	Yes	We randomly divided them into a training set consisting of 60 categories and 8 settings and a test set of 30 categories and 4 settings. This yields a total of 2,400 prompts for training and 600 prompts for testing. In the end, we obtained 2,560 text prompts for training and 640 prompts for testing.
Hardware Specification	Yes	We fine-tuned Stable Diffusion 2.1 for 5,000 iterations using two NVIDIA A100 GPUs (each with 82 GB VRAM), which took 8 hours per run. We fine-tuned Pix Art-α for 2,000 iterations using a single A100 GPU, with each procedure taking 2 hours.
Software Dependencies	No	The paper mentions using the 'open-source package Diffusers' and models like Stable Diffusion 2.1, Pix Art-α, GPT-4o, and Cog VLM2, but it does not specify version numbers for any of these software dependencies.
Experiment Setup	Yes	For fine-tuning Stable Diffusion 2.1, we set the batch size to 16 per GPU and the number of gradient accumulation steps to 4, resulting in an effective batch size of 128. Consequently, we used a scaled learning rate of 1.28e-4 = 10-6 * 2 * 16 * 4. During fine-tuning, all parameters were frozen except for those in the Q, K projection layers of attention modules, excluding those in the first down-sampling block and the last up-sampling block in the U-Net. For fine-tuning Pix Art-α, we set the batch size to 64 and the learning rate to 2e-5, with gradient clipping set to 0.01.