3DIS: Depth-Driven Decoupled Image Synthesis for Universal Multi-Instance Generation

Authors: Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation.
Researcher Affiliation Academia Dewei Zhou 1, Ji Xie 1, Zongxin Yang 2, Yi Yang 1 1RELER, CCAI, Zhejiang University 2DBMI, HMS, Harvard University EMAIL {Zongxin Yang}@hms.harvard.edu
Pseudocode No The paper describes the 3DIS framework and its three key components: Scene Depth Map Generation, Layout Control, and Detail Rendering, in prose and with mathematical equations (e.g., for Cross Attention and filtering), but does not include a distinct pseudocode block or algorithm section.
Open Source Code Yes The code is available at: https://github.com/limuloo/3DIS.
Open Datasets Yes We conducted extensive experiments on two benchmarks to evaluate the performance of 3DIS: (i) COCO-Position (Lin et al., 2015; Zhou et al., 2024a): Evaluated the layout accuracy and coarse-grained category attributes of the scene depth maps. (ii) COCO-MIG (Zhou et al., 2024a): Assessed the fine-grained rendering capabilities. ... In alignment with this approach, we utilized the COCO dataset (Lin et al., 2015) for training.
Dataset Splits No We utilized a training set comprising 5,878 images from the LAION-art dataset (Schuhmann et al., 2021), selecting only those with a resolution exceeding 512x512 pixels and an aesthetic score of 8.0. ... We utilized the COCO dataset (Lin et al., 2015) for training. ... For a comprehensive evaluation, each model generated 750 images across both benchmarks. The paper mentions training data size for LAION-art and evaluation images for COCO-MIG/Position, but lacks specific training/validation/test splits for either dataset.
Hardware Specification No The paper does not explicitly provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using LDM3D, Adam W optimizer, Stanza (Qi et al., 2020), and Grounding-DINO (Liu et al., 2023) as tools and models, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We employed pyramid noise (Kasiopy, 2023) to fine-tune the LDM3D model for 2,000 steps, utilizing the Adam W (Kingma & Ba, 2017) optimizer with a constant learning rate of 1e 4, a weight decay of 1e 2, and a batch size of 320.