Optimized View and Geometry Distillation from Multi-view Diffuser

Authors: Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. ... We conduct extensive experiments, both qualitatively and quantitatively, to demonstrate the effectiveness of our method.
Researcher Affiliation Academia Youjia Zhang1 , Zikai Song1 , Junqing Yu1 , Yawei Luo2 and Wei Yang1 1Huazhong University of Science and Technology 2Zhejiang University EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source code of our work is publicly available at: https: //youjiazhang.github.io/USD/.
Open Datasets Yes Following prior research [Liu et al., 2023b; Liu et al., 2024; Long et al., 2024], we adopt the Google Scanned Object dataset [Downs et al., 2022] for our evaluation, which includes a wide variety of common everyday objects.
Dataset Splits No The paper states it uses the Google Scanned Object dataset and that its evaluation dataset matches Sync Dreamer's, consisting of 30 objects. However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification Yes The USD (Ne RF) process takes about 1.5 hours on a NVIDIA Tesla V100 (32GB) GPU.
Software Dependencies Yes We adopt the Stable Diffusion [Takagi and Nishimoto, 2023] model of V2.1. The Dream Booth backbone is implemented using Stable Diffusion V2.1.
Experiment Setup Yes The Ne RF is optimized for 10,000 steps with an Adam optimizer at a learning rate of 0.01, weight decay of 0.05, and betas of (0.9, 0.95). For USD, the maximum and minimum time steps are decreased from 0.98 to 0.5 and 0.02, respectively, over the first 5,000 steps. We adopt the Stable Diffusion [Takagi and Nishimoto, 2023] model of V2.1. The classifier-free guidance (CFG) scale of the USD is set to 7.5 following [Wang et al., 2023b]. The Dream Booth backbone is implemented using Stable Diffusion V2.1. In the first stage, we use Stable Diffusion to generate 200 images as negative samples. Additionally, we utilize 6 positive sample images with 360 surrounding camera poses (at 60 intervals) for training. The USD (Ne RF) process takes about 1.5 hours on a NVIDIA Tesla V100 (32GB) GPU. To achieve reduced running time, we provide additional discussions and experimental results in the Appendix C. For Dream Booth fine-tuning, we train the model around for 600 steps with a learning-rate as 2e-6, weight decay as 0.01 and a batch size of 2.