Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds
Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present quantitative evaluation results for numerical and spatial composition in Tables 1 and 2. We calculate the output accuracy, the ratio of generated images correctly aligning with the text prompts. For numerical composition evaluation, we also compute the Mean Absolute Error (MAE) between the actual generated quantity and the specified quantity. We employ GPT-4o to determine the actual quantities or spatial relations in generated images (details in Appendix A.3.2). |
| Researcher Affiliation | Academia | Shuangqi Li1, Hieu Le1, Jingyi Xu2, Mathieu Salzmann1 1EPFL, Switzerland 2Stony Brook University, USA EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative text and figures but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | Our code is available at https://github.com/doub7e/Reliable-Random-Seeds. |
| Open Datasets | No | The paper introduces a new dataset called 'Comp90 dataset' which it creates for its experiments. While the paper describes the composition of this dataset in detail in Appendix A.1, it does not provide a specific link, DOI, or repository for public access to the dataset itself. |
| Dataset Splits | Yes | We randomly divided them into a training set consisting of 60 categories and 8 settings and a test set of 30 categories and 4 settings. This yields a total of 2,400 prompts for training and 600 prompts for testing. In the end, we obtained 2,560 text prompts for training and 640 prompts for testing. |
| Hardware Specification | Yes | We fine-tuned Stable Diffusion 2.1 for 5,000 iterations using two NVIDIA A100 GPUs (each with 82 GB VRAM), which took 8 hours per run. We fine-tuned Pix Art-α for 2,000 iterations using a single A100 GPU, with each procedure taking 2 hours. |
| Software Dependencies | No | The paper mentions using the 'open-source package Diffusers' and models like Stable Diffusion 2.1, Pix Art-α, GPT-4o, and Cog VLM2, but it does not specify version numbers for any of these software dependencies. |
| Experiment Setup | Yes | For fine-tuning Stable Diffusion 2.1, we set the batch size to 16 per GPU and the number of gradient accumulation steps to 4, resulting in an effective batch size of 128. Consequently, we used a scaled learning rate of 1.28e-4 = 10-6 * 2 * 16 * 4. During fine-tuning, all parameters were frozen except for those in the Q, K projection layers of attention modules, excluding those in the first down-sampling block and the last up-sampling block in the U-Net. For fine-tuning Pix Art-α, we set the batch size to 64 and the learning rate to 2e-5, with gradient clipping set to 0.01. |