reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

Authors: Slava Elizarov, Ciara Rowles, Simon Donné

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train our model on the Objaverse dataset (Deitke et al., 2022). We curate this dataset to include only objects with both high-quality structures and semantically meaningful UV maps by ﬁltering out 3D scans and low-poly models. The ﬁnal dataset contains approximately 100,000 objects. Each data entry is accompanied by captions provided by Cap3D (Luo et al., 2023) and Hong et al. (2024). We used T3Bench (He et al., 2023) for automatic evaluation, measuring both generation quality and prompt alignment of the resulting meshes. As shown in table 1, our method achieves competitive results compared to state-of-the-art models, while providing superior editability via separable parts and producing nearly artistic UV maps. In this section, we validate our method by ablating key design choices and evaluating their impact. Speciﬁcally, we ablated three aspects of our approach: the absence of cross-attention layers in the geometry image branch, the use of the Collaborative Control mechanism, and the cylindrical coordinate transform.
Researcher Affiliation	Industry	Slava Elizarov, Ciara Rowles, Simon Donné Unity Technologies
Pseudocode	No	The paper describes the methodology in detail but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that the authors are releasing the code for GIMDiffusion. It mentions using the codebase of Stable Diffusion and refers to other works with available source code, but not their own.
Open Datasets	Yes	We train our model on the Objaverse dataset (Deitke et al., 2022). We used T3Bench (He et al., 2023) for automatic evaluation, measuring both generation quality and prompt alignment of the resulting meshes. T3Bench includes three prompt sets: Single Object, Single Object with Surroundings, and Multiple Objects.
Dataset Splits	No	The paper mentions the total size of the curated Objaverse dataset (approximately 100,000 objects) and that 100 prompts from T3Bench were used for evaluation, but it does not provide specific training, validation, and test splits (e.g., percentages or counts) for the main dataset used for model training.
Hardware Specification	Yes	the entire pre-processing was performed on consumer-grade PC hardware (AMD Ryzen 9 7950X, Ge Force RTX 3090, 64 GB RAM) and took approximately 20 hours. All stages of training were conducted with a learning rate of 3e 5 on 8 A100 GPUs. All evaluations were performed on a single A100 GPU. The model was trained on 8 A100 GPUs with a batch size of 128 for 100k steps.
Software Dependencies	No	We trained our VAE following the procedure and codebase3 of Stable Diffusion (Rombach et al., 2021), leaving out only the GAN and LPIPS losses (see details in appendix A). For the frozen base model, we used a zero-terminal-SNR (Lin et al., 2024) version ﬁne-tuned from Stable Diffusion v2.1 (Rombach et al., 2021) as the base Text-to-Image model. We used Py Vista (Sullivan & Kaszynski, 2019) for all feed-forward methods. We constructed an index of binary multi-chart masks extracted from all geometry images in our dataset using the Faiss library (Douze et al., 2024). In cases where only a partial UV mapping is available, we use XAtlas (Young, 2022) to UV-unwrap the missing regions. The paper lists several software components and libraries (Stable Diffusion v2.1, Py Vista, Faiss, XAtlas) but does not provide specific version numbers for them, except for the base model being 'Stable Diffusion v2.1'.
Experiment Setup	Yes	Initially, we trained the model at 256 256 resolution for 250,000 steps with a batch size of 384, and then at the ﬁnal output resolution of 768 768 for a total of 100,000 steps with a batch size of 64. All stages of training were conducted with a learning rate of 3e 5 on 8 A100 GPUs. We trained our VAE following the procedure and codebase3 of Stable Diffusion (Rombach et al., 2021), leaving out only the GAN and LPIPS losses (see details in appendix A). The model was trained on 8 A100 GPUs with a batch size of 128 for 100k steps.