Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Authors: Abdelrahman Eldesokey, Peter Wonka

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our approach can generate complicated scenes based on 3D layouts, outperforming the standard depth-conditioned T2I methods by two-folds on object generation success rate. Moreover, it outperforms all methods in comparison on preserving objects under layout changes.
Researcher Affiliation Academia Abdelrahman Eldesokey & Peter Wonka King Abdullah University of Science and Technology (KAUST) Thuwal, Saudi Arabia {first.last}@kaust.edu.sa
Pseudocode No The paper describes methods using equations and textual descriptions, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor structured code-like procedures.
Open Source Code Yes Project Page: https://abdo-eldesokey.github.io/build-a-scene/ ... The source code and the evaluation protocol are publicly available. 1https://github.com/abdo-eldesokey/build-a-scene
Open Datasets Yes We define a set of 16 objects from the MS COCO dataset (Lin et al., 2014) and their corresponding aspect ratios.
Dataset Splits Yes We sampled 100 random layouts and ran each layout with 5 different seeds for fairness.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments. It mentions the software frameworks like Control Net and Stable Diffusion v1.5, but not the underlying hardware.
Software Dependencies Yes LC is based on Control Net with Stable Diffusion v1.5 Rombach et al. (2022)... We use a general object detector, YOLOv8 (Reis et al., 2023)... as an input to SAM (Kirillov et al., 2023)... monocular depth estimation model, i.e. Depth-Anything Yang et al. (2024)... we employ the Omni3D Brazil et al. (2023) detector
Experiment Setup Yes We perform T = 20 denoising steps in the quantitative comparison for efficiency and T = 40 for the qualitative results for better quality. ... We sampled 100 random layouts and ran each layout with 5 different seeds for fairness. ... We also experiment with varying T in Section 3.3 for blending the latents. ... By applying Equation (6) for T = 0.4T, the sofa is seamlessly integrated into the scene... When T = 0.8T, the sofa is seamlessly blended into the scene...