DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Authors: Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong MU
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments reveal the superiority of DIFFSPLAT in textand image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism. ... Quantitative evaluations on T3Bench prompts for text-conditioned generation. ... Quantitative evaluations on GSO for image-conditioned generation. |
| Researcher Affiliation | Collaboration | Chenguo Lin1, Panwang Pan 2, Bangbang Yang2, Zeming Li2, Yadong Mu 1 1Peking University, 2Byte Dance |
| Pseudocode | No | The paper describes methods using mathematical equations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Our code and models are publicly available at https://chenguolin.github.io/projects/Diff Splat. |
| Open Datasets | Yes | All our models in this work are trained on G-Objaverse (Qiu et al., 2024), a high-quality subset of Objaverse (Deitke et al., 2023) and comprising images from 38 different views of around 265K 3D objects. Captions of these 3D objects are provided by Cap3D (Luo et al., 2023; 2024). To quantitatively evaluate the performance of text-conditioned generation, 300 text prompts from T3Bench (He et al., 2023), describing a single object, a single object with surroundings and multiple objects, are employed as conditions. ... 300 objects from the unseen GSO (Downs et al., 2022) dataset are randomly selected and rendered to serve as ground-truth images |
| Dataset Splits | No | The paper mentions using 300 text prompts from T3Bench and 300 objects from the GSO dataset for evaluation, but it does not specify the training/validation splits for the main G-Objaverse dataset, only its total size and number of views. |
| Hardware Specification | Yes | Training batch size for reconstruction and auto-encoding is 64 in total across up to 16 A100 GPUs with gradient accumulation and the peak learning rate of 4e-4. For diffusion models, the batch size and peak learning rate are 128 and 1e-4 respectively. ... Notably, with 2D generative priors, DIFFSPLAT only takes about 3 days on 8 A100 GPUs to generate decent results with fp16 mixed precision |
| Software Dependencies | No | The paper mentions using specific diffusion models (SD1.5, SDXL, Pix Art-α, Pix Art-Σ, SD3) and optimizers (Adam W, DPM-Solver++ ODE solver) but does not provide specific version numbers for underlying software libraries like PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | For Gaussian splat grid reconstruction, we train a lightweight 12-layer and 8-head Transformer encoder (Vaswani et al., 2017) with 512 attention dimensions and a patch size of 8, whose parameter size is only 42M... smin and smax are set to 5e-4 and 2e-2 respectively... The input views Vin =4 are evenly distributed and rendering views V =8 include 4 other random viewpoints. All weighting terms are set to 1. ... All experiments are conducted at the 256 256 resolution in this work. Training batch size for reconstruction and auto-encoding is 64 in total across up to 16 A100 GPUs with gradient accumulation and the peak learning rate of 4e-4. For diffusion models, the batch size and peak learning rate are 128 and 1e-4 respectively. Adam W optimizer (Loshchilov & Hutter, 2018) with weight decay and cosine learning rate scheduler (Loshchilov & Hutter, 2016) with linear warm-up are adopted for parameter optimization. ... DPM-Solver++ (Lu et al., 2022a;b) ODE solver with 20 inference steps is adopted... The flowbased model, i.e., SD3 (Esser et al., 2024) uses the original flow matching Euler ODE solver (Lipman et al., 2023) with 28 steps... classifier-free guidance (Ho & Salimans, 2021) scales for each model are the same with their default values: 7.5 for SD1.5, 5 for SDXL, 4.5 for Pix Art-α and Pix Art-Σ, and 7 for SD3. In the image-conditioned generation, all models are fine-tuned to predict velocity... and their guidance scales are all set to 2. |