TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers
Authors: Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, Haoqian Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Tran Splat on two large-scale benchmarks: Real Estate10K (Zhou et al. 2018) and ACID (Liu et al. 2021a). Extensive experiments are conducted and demonstrate that Tran Splat achieves the best results in G-3DGS. Notably, compared to existing counterparts , Tran Splat presents strong cross-dataset generalization ability. Comprehensively, our main contributions are as follows: We propose to utilize the depth confidence map to enhance matching between various views and correspondingly significantly improve the reconstruction precision in regions with insufficient texture or repetitive patterns. We propose a strategy that encodes the priors of monocular depth estimators into the prediction of Gaussian parameters, ensuring precise 3D Gaussian centers are estimated even in non-overlapping areas. The derived method Tran Splat achieves the best results on two large-scale benchmarks and presents strong cross-dataset generalization ability. |
| Researcher Affiliation | Collaboration | Chuanrui Zhang1*, Yingshuang Zou1*, Zhuoling Li2, Minmin Yi3, Haoqian Wang1 1Tsinghua University, 2The University of Hong Kong, 3E-surfing Vision Technology Co., Ltd EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods through figures (Figure 2, Figure 3, Figure 4) and textual explanations but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project Page: https://xingyoujun.github.io/transplat/ (Visiting this project page explicitly states: 'Code will be released soon.') |
| Open Datasets | Yes | We evaluate Tran Splat on two large-scale benchmarks: Real Estate10K (Zhou et al. 2018) and ACID (Liu et al. 2021a) datasets. Additionally, to assess cross-dataset generalization, we evaluate all methods on the multi-view DTU dataset (Jensen et al. 2014). |
| Dataset Splits | Yes | Real Estate10K comprises home walkthrough videos from You Tube, with 67,477 scenes for training and 7,289 scenes for testing. The ACID dataset, featuring aerial landscape videos, includes 11,075 training scenes and 1,972 testing scenes. |
| Hardware Specification | Yes | All models are trained with a batch size of 14 on 7 RTX 3090 GPUs for 300,000 iterations using the Adam (Kingma 2014) optimizer. During inference, we measure speed and memory cost with one RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions models, optimizers (Adam), and frameworks (Swin Transformer, Depth Anything V2, diffusion models) but does not provide specific version numbers for any software libraries or dependencies used for implementation. |
| Experiment Setup | Yes | Input images are resized to 256 256, following the method outlined in (Chen et al. 2024). In all experiments, the number of depth candidates is set to 128. We sample P = 4 deformable points in the Depth-Aware Deformable Matching Transformer for the main results. For the Depth Anything V2 (Yang et al. 2024) module, we use the base size to balance training cost and result quality. All models are trained with a batch size of 14 on 7 RTX 3090 GPUs for 300,000 iterations using the Adam (Kingma 2014) optimizer. |