Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors
Authors: Soumava Paul, Prakhar Kaushik, Alan Yuille
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on the Mip Ne RF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed (precomputed camera parameters are given) reconstruction methods in complex 360 scenes. Our project page1 provides additional results, videos, and code. We compare GScenes with state-of-the-art pose-free and posed sparse-view reconstruction methods in Fig 9, 10 and Table 1, 2. We also ablate the different components and design choices of our diffusion model. |
| Researcher Affiliation | Academia | Soumava Paul, Prakhar Kaushik, Alan Yuille CCVL, Johns Hopkins University EMAIL |
| Pseudocode | Yes | Algorithm 1 Gaussian Scenes Training |
| Open Source Code | Yes | Our project page1 provides additional results, videos, and code. 1https://gaussianscenes.github.io ... An open-source low-cost solution with lower data and compute requirements compared to state-of-the-art posed reconstruction methods. |
| Open Datasets | Yes | Evaluations on the Mip Ne RF360 and DL3DV-10K benchmark dataset demonstrate that our method surpasses existing pose-free techniques... We evaluate GScenes on the 9 scenes of the Mip Ne RF360 dataset (Barron et al., 2022), and 15 scenes (out of 140) of the DL3DV-10K benchmark dataset. ... We fine-tune our diffusion model on a mix of 1043 scenes encompassing Tanks and Temples (Knapitsch et al., 2017), CO3D (Reizenstein et al., 2021), Deep Blending (Hedman et al., 2018), and the 1k subset of DL3DV-10K (Ling et al., 2024) to obtain a total of 171, 461 data samples. |
| Dataset Splits | Yes | For Mip Ne RF360, We pick the M-view splits as proposed by Recon Fusion and CAT3D and evaluate all baselines on the official test views where every 8th image is held out for testing. For DL3DV-10K scenes, we create M-view splits using a greedy view-selection heuristic for maximizing scene coverage given a set of dense training views, similar to the heuristic proposed in Wu et al. (2024). For test views, we hold out every 8th image as in Mip Ne RF360. For a given scene, we fit sparse models for M {3, 6, 9, 18} number of views. |
| Hardware Specification | Yes | GScenes is implemented in Py Torch 2.3.1 on single A5000/A6000 GPUs. Finetuning this model takes about 4-days on a single A6000 GPU. GScenes completes full 3D reconstruction in approximately 5 minutes on a single A6000 GPU. |
| Software Dependencies | Yes | GScenes is implemented in Py Torch 2.3.1 on single A5000/A6000 GPUs. |
| Experiment Setup | Yes | The diffusion model is finetuned for 100k iterations (batch size 16, learning rate 1e-4) with conditioning element dropout probability of 0.05 for CFG. Following Instant Splat, we fit 3D Gaussians to sparse inputs and MASt3r point clouds for 1k iterations to obtain G. We use classifier-free guidance scales s I = s C = 3.0 and sample with k = 20 DDIM steps. We linearly decay wd from 1 to 0.01 and Lsample weight from 1 to 0.1 over 10k iterations. We finetune this VAE on a subset of our dataset for 5000 training steps with batch size 16 and learning rate 1e-05. |