A3D: Does Diffusion Dream about 3D Alignment?

Authors: Savva Ignatyev, Nina Konovalova, Daniil Selikhanovych, Oleg Voinov, Nikolay Patakin, Ilya Olkov, Dmitry Senushkin, Alexey Artemov, Anton Konushin, Alexander Filippov, Peter Wonka, Evgeny Burnaev

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our method with alternatives quantitatively for the generation of pairs of aligned 3D objects and for the structure-preserving transformation of 3D models. We discuss the results of the hybridization of aligned objects generated with our method from the qualitative perspective. We implement of our method based on MVDream text-to-3D generation model (Shi et al., 2024), that uses an efficient version of Ne RF Instant-NGP (M uller et al., 2022) to represent the 3D scene. In Appendix A, we show that our method is also effective in combination with a different model Rich Dreamer (Qiu et al., 2024) that represents 3D objects using DMTet (Shen et al., 2021). We show additional applications of our method in Appendix E. We refer the reader to the complete set of animated results of our experiments on the project page for a more complete picture. In Appendix B.4, we provide the computational costs and hardware details. Metrics. We quantify three aspects of the generated pairs of aligned objects and the results of the structure-preserving transformation. The first one is the degree of alignment between the corresponding structural parts of the objects in a generated pair, or of a source 3D model and its transformed version. Measuring such alignment directly would require explicit detection of corresponding structural parts for an arbitrary pair of objects, which is a hard task by itself, even if the objects have similar structure. Recently, Tang et al. (2023) have proposed a method for finding corresponding points in pairs of images of arbitrary similar objects by matching features extracted from pretrained 2D diffusion models. Based on this method, called DIFT, we define DIFT distance that we use to measure the structural alignment. To compute this distance for a pair of objects, we render them from the same viewpoint. We densely sample points on one of the renders and find the corresponding points on the other one with DIFT. For an ideally aligned pair of objects, a sample and its corresponding point have identical image coordinates. So, we define the DIFT distance as the average distance between these coordinates across all samples. We normalize it by the size of the objects in image space, for better interpretability. We report the value averaged across multiple viewpoints around the objects and for the points sampled for each of the objects in the pair. Second, we measure the semantic coherence between a generated object and the respective text prompt. We measure it following the methodology of GPTEval3D (Mao et al., 2023), that was shown to align with human perception well. Specifically, we ask a Large Multimodal Model GPT4o (Open AI, 2024) to compare the 3D objects generated with two methods for the same text prompt and choose the one that is more consistent with the prompt, based on Text-Asset Alignment and Text-Geometry Alignment. We compare our method against each alternative and report the percentage of comparisons in which our method is preferred. Additionally, we measure the coherence between the generated object and the prompt using CLIP similarity (Jain et al., 2022), which is defined as cosine similarity between the CLIP (Radford et al., 2021) embeddings of a render of the object and the respective text prompt. Finally, we evaluate the visual quality of the generated objects and the quality of their surface. For this, we compare the objects generated with two methods using GPTEval3D based on 3D Plausibility, Texture Details, Geometry Details, and Overall quality.
Researcher Affiliation Collaboration 1Skoltech, Russia 2AIRI, Russia 3Medida AI, Israel 4AI Foundation and Algorithm Lab, Russia 5KAUST, Saudi Arabia
Pseudocode No The paper describes methods using equations and prose, but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper references source code for several third-party tools and baselines (e.g., MVDream, MVEdit, Gaussian Editor, Lucid Dreamer, Stable Diffusion 2.1, Zero123++, Gaussian splatting, Instruct Pix2Pix, Language Segment-Anything, Dreamshaper 8) but does not provide an explicit statement or link for the code implementation of their own method, A3D.
Open Datasets Yes MVDream fine-tuned the open-source Stable Diffusion 2.1 model (Rombach & Esser) on the Objaverse (Deitke et al., 2023) and LAION (Schuhmann et al., 2022) datasets.
Dataset Splits No The paper mentions using Objaverse and LAION datasets for fine-tuning the base MVDream model, and using 120 renders from 120 camera positions for evaluation and generating Gaussian Splatting representations, but it does not provide specific training/validation/test splits for its own experiments or for the models it fine-tunes.
Hardware Specification Yes We run all experiments on a single Nvidia A100 GPU. The authors acknowledge the use of the Skoltech supercomputer Zhores (Zacharov et al., 2019) to obtain the results presented in this paper.
Software Dependencies Yes We use the same Ne RF architecture and the majority of hyperparameters for training as MVDream (Shi et al., 2024). We build our method based on the official MVDream implementation using threestudio framework (Shi et al., 2023c). We utilize SDS with Stable Diffusion 2.1 and a depth-normal diffusion model to improve the accuracy of depth predictions.
Experiment Setup Yes We use the same Ne RF architecture and the majority of hyperparameters for training as MVDream (Shi et al., 2024). We gradually increase the weight of the orientation penalty from 100 to 1000. We set the weight of the normal smoothness loss to the value of 10. We initialize the geometry representation using a uniform sphere with a radius of one, and utilize SDS with Stable Diffusion 2.1 and a depth-normal diffusion model to improve the accuracy of depth predictions. For smoother interpolation between prompts, we incorporate normal consistency loss, and after experimentation, we found that setting the loss coefficient between 3 and 5 yields better results than the original configuration.