Optimized View and Geometry Distillation from Multi-view Diffuser
Authors: Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. ... We conduct extensive experiments, both qualitatively and quantitatively, to demonstrate the effectiveness of our method. |
| Researcher Affiliation | Academia | Youjia Zhang1 , Zikai Song1 , Junqing Yu1 , Yawei Luo2 and Wei Yang1 1Huazhong University of Science and Technology 2Zhejiang University EMAIL |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code of our work is publicly available at: https: //youjiazhang.github.io/USD/. |
| Open Datasets | Yes | Following prior research [Liu et al., 2023b; Liu et al., 2024; Long et al., 2024], we adopt the Google Scanned Object dataset [Downs et al., 2022] for our evaluation, which includes a wide variety of common everyday objects. |
| Dataset Splits | No | The paper states it uses the Google Scanned Object dataset and that its evaluation dataset matches Sync Dreamer's, consisting of 30 objects. However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts). |
| Hardware Specification | Yes | The USD (Ne RF) process takes about 1.5 hours on a NVIDIA Tesla V100 (32GB) GPU. |
| Software Dependencies | Yes | We adopt the Stable Diffusion [Takagi and Nishimoto, 2023] model of V2.1. The Dream Booth backbone is implemented using Stable Diffusion V2.1. |
| Experiment Setup | Yes | The Ne RF is optimized for 10,000 steps with an Adam optimizer at a learning rate of 0.01, weight decay of 0.05, and betas of (0.9, 0.95). For USD, the maximum and minimum time steps are decreased from 0.98 to 0.5 and 0.02, respectively, over the first 5,000 steps. We adopt the Stable Diffusion [Takagi and Nishimoto, 2023] model of V2.1. The classifier-free guidance (CFG) scale of the USD is set to 7.5 following [Wang et al., 2023b]. The Dream Booth backbone is implemented using Stable Diffusion V2.1. In the first stage, we use Stable Diffusion to generate 200 images as negative samples. Additionally, we utilize 6 positive sample images with 360 surrounding camera poses (at 60 intervals) for training. The USD (Ne RF) process takes about 1.5 hours on a NVIDIA Tesla V100 (32GB) GPU. To achieve reduced running time, we provide additional discussions and experimental results in the Appendix C. For Dream Booth fine-tuning, we train the model around for 600 steps with a learning-rate as 2e-6, weight decay as 0.01 and a batch size of 2. |