ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis
Authors: Bernard Spiegl, Andrea Perin, Stephane Deny, Alexander Ilin
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed approach on the Neural 3D Mesh Renderer (NMR) dataset (Kato et al., 2018; Chang et al., 2015) consisting of a wide variety of classes and input view poses. Through quantitative evaluation we show improved performance compared to relevant methods. We also qualitatively explore intermediate outputs of the model and confirm the soundness of our pixel-weighting mechanism to infer and adaptively adjust the importance of each of the input views: the inferred weighting scheme aligns with our human intuition that input views closer to the target view should be more informative than the further ones. |
| Researcher Affiliation | Collaboration | Bernard Spiegl EMAIL Aalto University Andrea Perin EMAIL Aalto University Stéphane Deny EMAIL Aalto University Alexander Ilin EMAIL System 2 AI Aalto University |
| Pseudocode | Yes | Algorithm 1 Composing View Contributions Training Algorithm 2 Composing View Contributions Inference Listing 1 shows Py Torch pseudocode for aggregating the view contributions at each diffusion step, given an arbitrary, unordered and pose-free collection of input views. |
| Open Source Code | Yes | Code is available at https://github.com/bronemos/view-fusion. |
| Open Datasets | Yes | We evaluate our method on a relatively small, but diverse dataset, NMR, consisting of a variety of scenes and spanning multiple classes. We show that our model is capable of handling a wide variety of settings, while offering performances near or above the current comparable methods. Neural 3D Mesh Renderer Dataset (NMR). NMR has been used extensively in the previous works (Lin et al., 2023; Sajjadi et al., 2022b; Yu et al., 2021; Sitzmann et al., 2021) and serves as a good benchmark while keeping the computational footprint relatively low. The dataset is based on 3D renderings provided in Kato et al. (2018) and consists of 13 classes (sofa, airplane, lamp, telephone, vessel, loudspeaker, chair, cabinet, table, display, car, bench, rifle) from Shape Net Core (Chang et al., 2015) that were rendered from 24 azimuth angles (rotated around the vertical axis) at a fixed elevation angle using the same camera and lighting conditions. |
| Dataset Splits | Yes | In total, there are 44 k different objects, split across training, validation and testing sets as follows: 31 k, 4 k, 9 k. There are no overlaps in individual objects between the sets. |
| Hardware Specification | Yes | The model is conditioned on one to six input views and trained for 710k steps using a batch size of 112 and 4 V100 GPUs. The total training time using this setup amounts to approximately 6.5 days. At inference time, we run the model for 2000 timesteps which takes around 2 minutes and does not depend on the amount of views used for conditioning (as long as they fit in the memory) since all the streams are treated as a batch. Using a single 32GB V100, we are able to process a batch size of 28 with up to six conditional input views, meaning that our model is able to process up to 168 64 64 images at a time. |
| Software Dependencies | No | No specific software dependencies with version numbers were mentioned in the paper. |
| Experiment Setup | Yes | We base our U-Net architecture on Saharia et al. (2022) with modifications listed in Section 3.1. Following Karras et al. (2022), a linear noise scheduling is applied for the diffusion process spanning (1e-6, 0.01) over 2000 timesteps, both for training and inference. We train the model using L2 loss computed between the loss prediction and true noise. Furthermore, a learning rate scheduler is used in combination with Adam optimizer. The learning rate starts at 5e-5 with a 10k steps as a warm-up following which it peaks at 1e-4. The model is conditioned on one to six input views and trained for 710k steps using a batch size of 112 and 4 V100 GPUs. |