Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Authors: Bonan Li, Yinhan Hu, Songhua Liu, Xinchao Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Win Win Lay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
Researcher Affiliation Academia 1University of Chinese Academy of Sciences, Beijing, China 2National University of Singapore, Singapore.
Pseudocode No The paper includes mathematical equations and descriptions of the method, but it does not contain a clearly labeled pseudocode block or algorithm figure.
Open Source Code No The paper does not provide an explicit statement about releasing source code, nor does it include a link to a code repository.
Open Datasets Yes Akin to prior work (Chen et al., 2024d), we quantitatively evaluate our Win Win Lay on COCO2014 (Lin et al., 2014) and Flickr30K (Plummer et al., 2015).
Dataset Splits No The paper states that evaluation is done on COCO2014 and Flickr30K akin to prior work, but it does not explicitly provide specific training/test/validation split percentages, sample counts, or detailed splitting methodology within the text.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions using "Stable Diffusion 1.5" and tools like "YOLOv7" and "CLIP-s", but it does not list specific software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or their respective version numbers, which are essential for reproducibility.
Experiment Setup Yes We adopt the Stable Diffusion 1.5 (Rombach et al., 2022), pre-trained on the LAION5B (Schuhmann et al., 2022a), as our base Text-to-Image model. During generation, we employ the DDIM sampler with 50 steps and set the scale guidance to 7.5 for generation. Since layout construction typically occurs during the early stages of denoising, we apply the layout constraint only within the initial 10 steps. The hyperparameters ρ of non-local attention prior is set to 5/0 for max/min, respectively. For adaptive update, we set steps O of Langevin dynamics is set as 4 and signal-to-noise ratio r as 0.06.