PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting for Novel View Synthesis

Authors: Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, Seungryong Kim

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on large-scale real-world datasets confirm that PF3plat achieves state-of-the-art performance across all benchmarks, with ablation studies validating our design choices.
Researcher Affiliation Collaboration 1KAIST AI 2Microsoft Research Asia. Correspondence to: Jiaolong Yang <EMAIL>, Chong Luo <EMAIL>, Seungryong Kim <EMAIL>.
Pseudocode No The paper includes figures (Figure 1 and Figure 2) illustrating the overall framework and modules. However, these are diagrams and do not present structured pseudocode or algorithm blocks with numbered steps.
Open Source Code No The code and pretrained weights will be made publicly available.
Open Datasets Yes Our extensive evaluations on large-scale real-world indoor and outdoor datasets (Liu et al., 2021; Zhou et al., 2018; Ling et al., 2024) demonstrate that PF3plat sets a new stateof-the-art across all benchmarks. ... We train and evaluate our method on three largescale datasets: Real Estate10K (Zhou et al., 2018), a collection of both indoor and outdoor scenes; ACID (Liu et al., 2021), a dataset focusing on outdoor coastal scenes; and DL3DV (Ling et al., 2024), which includes diverse realworld indoor and outdoor environments.
Dataset Splits Yes For Real Estate10K, due to some unavailable videos on You Tube, we use a subset of the full dataset, comprising a training set of 21,618 scenes and a test set of 7,200 scenes. For ACID, we train on 10,935 scenes and evaluate on 1,893 scenes. Lastly, for DL3DV, we train on 10,510 different scenes and evaluate on the standard benchmark set of 140 scenes for testing (Ling et al., 2024). ... The test set is divided into three groups, small, middle, and large, based on the extent of overlap between I1 and I2. ... For each scene, we select two context images, I1 and I2, by skipping frames at intervals of 5 and 10, creating two groups per scene, each representing small and large overlap cases. We then randomly select three target images from the sequence between the context images.
Hardware Specification Yes Our model is trained on 4 NVIDIA A100 GPU for 50,000 iterations using the Adam optimizer (Kingma, 2014)... Specifically, we train MVSplat for 200,000 iterations using a batch size of 8 on a single A6000 GPU. All the hyperparameters are set to the authors default setting. For Co Po Ne RF, we train it for 50,000 iterations using 8 A6000 GPUs with effective batch size of 64, following the authors original implementations and hyperparameters. Finally, for Instant Splat (Fan et al., 2024), we train and evaluate on a single A6000 GPU with a batch size of 1 by following the official code.
Software Dependencies No The paper mentions the use of 'Flash Attention (Dao et al., 2022)' and 'Adam optimizer (Kingma, 2014)' but these are algorithms/methods. No specific software libraries with version numbers are provided.
Experiment Setup Yes Our model is trained on 4 NVIDIA A100 GPU for 50,000 iterations using the Adam optimizer (Kingma, 2014), with a learning rate set to 8 10 4 and a batch size of 9 per each GPU, which takes approximately two days. For training on the Real Estate10K and ACID datasets, we gradually increase the frame distance between I1 and I2 as training progresses, initially setting the frame distance to 15 and gradually increasing it to 75. For the DL3DV dataset, we start with a frame distance of 5 and increase it to 10. The target view is randomly sampled within this range. ... We combine photometric loss, defined as the L2 loss between the rendered and target images, as well as SSIM (Wang et al., 2004) loss LSSIM and LPIPS (Zhang et al., 2018) loss LLPIPS to form our reconstruction loss Limg. ... we define our final objective function: Limg + L2D 3D +λ3D 3DL3D 3D, where we set λ3D 3D = 0.05.