LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Authors: Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more results: https://haian-jin.github.io/projects/LVSM/. 4 EXPERIMENTS We first describe the datasets we use and baseline methods we compare to, then present the results of LVSM for both object-level and scene-level novel view synthesis. 4.1 DATASETS 4.2 TRAINING DETAILS 4.3 COMPARISON TO BASELINES In this section, we describe our experimental setup and datasets (Sec. 4.1), introduce our model training details (Sec. 4.2), report evaluation results (Sec. 4.3) and perform an ablation study (Sec. 4.4). Object-Level Results. We compare with Instant3D s Triplane-LRM (Li et al., 2023) and GSLRM (Zhang et al., 2024) at a resolution of 512. As shown on the left side of Table 1, our LVSM method achieves the best performance. In particular, at 512 resolution, our decoder-only LVSM achieves a 3 d B and 2.8 d B PSNR gain against the best prior method GS-LRM on ABO and GSO, respectively; our encoder-decoder LVSM achieves performance comparable to GS-LRM. 4.4 ABLATION STUDIES |
| Researcher Affiliation | Collaboration | Haian Jin1 Hanwen Jiang2 Hao Tan3 Kai Zhang3 Sai Bi3 Tianyuan Zhang4 Fujun Luan3 Noah Snavely1 Zexiang Xu3 1Cornell University 2The University of Texas at Austin 3Adobe Research 4Massachusetts Institute of Technology |
| Pseudocode | No | The paper describes the model architecture in Section 3.2 and provides a detailed architectural diagram in Figure 8, along with mathematical formulations (e.g., Equations 1-8). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured, step-by-step instructions typical of pseudocode. |
| Open Source Code | No | Please see our website for more results: https://haian-jin.github.io/projects/LVSM/. This is a project page demonstrating results, not an explicit statement of code release or a link to a code repository for the methodology described in the paper. |
| Open Datasets | Yes | We use the Objaverse dataset (Deitke et al., 2023) to train LVSM. ... We test on two object-level datasets, Google Scanned Objects (Downs et al., 2022) (GSO) and Amazon Berkeley Objects (Collins et al., 2022b) (ABO). ... We use the Real Estate10K dataset (Zhou et al., 2018), which contains 80K video clips curated from 10K Youtube videos of both indoor and outdoor scenes. |
| Dataset Splits | Yes | We follow the rendering settings in GS-LRM (Zhang et al., 2024) and render 32 random views of 730K objects. ... Following Instant3D (Li et al., 2023) and GS-LRM (Zhang et al., 2024), we use 4 sparse views as test inputs and another 10 views as target images. ... We follow the train/test data split used in pixel Splat (Charatan et al., 2024). |
| Hardware Specification | Yes | Our final models were trained on 64 A100 GPUs for 3-7 days, depending on the data type and model architecture, but we found that even with just 1 2 A100 GPUs for training, our model (with a decreased model and batch size) still outperforms all previous methods trained with equal or even more compute resources. ... with only a single A100 80G GPU for 7 days. |
| Software Dependencies | Yes | We use Flash Attention-v2 (Dao, 2023) in the x Formers (Lefaudeux et al., 2022), gradient checkpointing (Chen et al., 2016), and mixed-precision training with Bfloat16 data type to accelerate training. ... We train our model with Adam W optimizer (Kingma, 2014). |
| Experiment Setup | Yes | We train LVSM with 64 A100 GPUs with a batch size of 8 per GPU. We use a cosine learning rate schedule with a peak learning rate of 4e-4 and a warmup of 2500 iterations. We train LVSM for 80k iterations on the object and 100k on scene data. LVSM uses a image patch size of p = 8 and token dimension d = 768. The details of the transformer layers follow GS-LRM(Zhang et al., 2024) with an additional QK-Norm. Unless noted, all models have 24 transformer layers, the same as GS-LRM. The encoder-decoder LVSM has 12 encoder layers and 12 decoder layers, with 3072 latent tokens. ... For object-level experiments, we use 4 input views and 8 target views for each training example by default. ... For scene-level experiments We use 2 input views and 6 target views for each training example. ... We use a perceptual loss weight λ as 0.5 and 1.0 on scene-level and object-level experiments, respectively. We do not use bias terms in our model, for both Linear and Layer Norm layers. We initialize the model weights with a normal distribution of zero-mean and standard deviation of 0.02/(2 (idx+1)) 0.5, where idx means transform layer index. ... The β1 and β2 are set to 0.9 and 0.95 respectively, following GS-LRM. We use a weight decay of 0.05 on all parameters except the weights of Layer Norm layers. |