Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Authors: Slava Elizarov, Ciara Rowles, Simon Donné

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train our model on the Objaverse dataset (Deitke et al., 2022). We curate this dataset to include only objects with both high-quality structures and semantically meaningful UV maps by filtering out 3D scans and low-poly models. The final dataset contains approximately 100,000 objects. Each data entry is accompanied by captions provided by Cap3D (Luo et al., 2023) and Hong et al. (2024). We used T3Bench (He et al., 2023) for automatic evaluation, measuring both generation quality and prompt alignment of the resulting meshes. As shown in table 1, our method achieves competitive results compared to state-of-the-art models, while providing superior editability via separable parts and producing nearly artistic UV maps. In this section, we validate our method by ablating key design choices and evaluating their impact. Specifically, we ablated three aspects of our approach: the absence of cross-attention layers in the geometry image branch, the use of the Collaborative Control mechanism, and the cylindrical coordinate transform.
Researcher Affiliation Industry Slava Elizarov, Ciara Rowles, Simon Donné Unity Technologies
Pseudocode No The paper describes the methodology in detail but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the authors are releasing the code for GIMDiffusion. It mentions using the codebase of Stable Diffusion and refers to other works with available source code, but not their own.
Open Datasets Yes We train our model on the Objaverse dataset (Deitke et al., 2022). We used T3Bench (He et al., 2023) for automatic evaluation, measuring both generation quality and prompt alignment of the resulting meshes. T3Bench includes three prompt sets: Single Object, Single Object with Surroundings, and Multiple Objects.
Dataset Splits No The paper mentions the total size of the curated Objaverse dataset (approximately 100,000 objects) and that 100 prompts from T3Bench were used for evaluation, but it does not provide specific training, validation, and test splits (e.g., percentages or counts) for the main dataset used for model training.
Hardware Specification Yes the entire pre-processing was performed on consumer-grade PC hardware (AMD Ryzen 9 7950X, Ge Force RTX 3090, 64 GB RAM) and took approximately 20 hours. All stages of training were conducted with a learning rate of 3e 5 on 8 A100 GPUs. All evaluations were performed on a single A100 GPU. The model was trained on 8 A100 GPUs with a batch size of 128 for 100k steps.
Software Dependencies No We trained our VAE following the procedure and codebase3 of Stable Diffusion (Rombach et al., 2021), leaving out only the GAN and LPIPS losses (see details in appendix A). For the frozen base model, we used a zero-terminal-SNR (Lin et al., 2024) version fine-tuned from Stable Diffusion v2.1 (Rombach et al., 2021) as the base Text-to-Image model. We used Py Vista (Sullivan & Kaszynski, 2019) for all feed-forward methods. We constructed an index of binary multi-chart masks extracted from all geometry images in our dataset using the Faiss library (Douze et al., 2024). In cases where only a partial UV mapping is available, we use XAtlas (Young, 2022) to UV-unwrap the missing regions. The paper lists several software components and libraries (Stable Diffusion v2.1, Py Vista, Faiss, XAtlas) but does not provide specific version numbers for them, except for the base model being 'Stable Diffusion v2.1'.
Experiment Setup Yes Initially, we trained the model at 256 256 resolution for 250,000 steps with a batch size of 384, and then at the final output resolution of 768 768 for a total of 100,000 steps with a batch size of 64. All stages of training were conducted with a learning rate of 3e 5 on 8 A100 GPUs. We trained our VAE following the procedure and codebase3 of Stable Diffusion (Rombach et al., 2021), leaving out only the GAN and LPIPS losses (see details in appendix A). The model was trained on 8 A100 GPUs with a batch size of 128 for 100k steps.