Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation
Authors: Abdelrahman Eldesokey, Peter Wonka
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our approach can generate complicated scenes based on 3D layouts, outperforming the standard depth-conditioned T2I methods by two-folds on object generation success rate. Moreover, it outperforms all methods in comparison on preserving objects under layout changes. |
| Researcher Affiliation | Academia | Abdelrahman Eldesokey & Peter Wonka King Abdullah University of Science and Technology (KAUST) Thuwal, Saudi Arabia {first.last}@kaust.edu.sa |
| Pseudocode | No | The paper describes methods using equations and textual descriptions, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor structured code-like procedures. |
| Open Source Code | Yes | Project Page: https://abdo-eldesokey.github.io/build-a-scene/ ... The source code and the evaluation protocol are publicly available. 1https://github.com/abdo-eldesokey/build-a-scene |
| Open Datasets | Yes | We define a set of 16 objects from the MS COCO dataset (Lin et al., 2014) and their corresponding aspect ratios. |
| Dataset Splits | Yes | We sampled 100 random layouts and ran each layout with 5 different seeds for fairness. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments. It mentions the software frameworks like Control Net and Stable Diffusion v1.5, but not the underlying hardware. |
| Software Dependencies | Yes | LC is based on Control Net with Stable Diffusion v1.5 Rombach et al. (2022)... We use a general object detector, YOLOv8 (Reis et al., 2023)... as an input to SAM (Kirillov et al., 2023)... monocular depth estimation model, i.e. Depth-Anything Yang et al. (2024)... we employ the Omni3D Brazil et al. (2023) detector |
| Experiment Setup | Yes | We perform T = 20 denoising steps in the quantitative comparison for efficiency and T = 40 for the qualitative results for better quality. ... We sampled 100 random layouts and ran each layout with 5 different seeds for fairness. ... We also experiment with varying T in Section 3.3 for blending the latents. ... By applying Equation (6) for T = 0.4T, the sofa is seamlessly integrated into the scene... When T = 0.8T, the sofa is seamlessly blended into the scene... |