Controlling Space and Time with Diffusion Models

Authors: Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David Fleet

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that 4Di M outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. We include extensive evaluation and comparisons to prior work, including fixes to Sf M-based metrics (He et al., 2024), and a novel keypoint distance metric to detect the presence of dynamics.
Researcher Affiliation Industry Daniel Watson , Saurabh Saxena , Lala Li , Andrea Tagliasacchi, David Fleet Google Deep Mind
Pseudocode No The paper contains architectural diagrams (Fig. 2) and mathematical formulations (e.g., Equation 1), but no explicitly labeled pseudocode or algorithm blocks describing a method or procedure in a structured, code-like format.
Open Source Code No The paper mentions 'See https://4d-diffusion.github.io for video samples.' multiple times (abstract, Figure 3, Figure 6, Section J). This URL is presented as a destination for video samples, not for the source code of the methodology described in the paper. There is no explicit statement or link indicating that the code is publicly available.
Open Datasets Yes Figure 1: Zero-shot samples from 4Di M on LLFF (Mildenhall et al., 2019) and Davis (Pont Tuset et al., 2017)... and The 3D datasets used to train 4Di M include Scan Net++ (Yeshwanth et al., 2023) and Matterport3D (Chang et al., 2017)... and One particularly rich dataset for training 3D models is Real Estate10K (Zhou et al., 2018) (RE10K).
Dataset Splits Yes We use 1% of the dataset as our validation split and compute metrics on all baselines ourselves for this split, noting they might instead be advantaged as our test data may exist in their training data (for PNVS, all our test data is in fact part of their training dataset). For the OOD case, we use the LLFF dataset (Mildenhall et al., 2019). We present our main results on 3D novel view synthesis conditioned on a single image in Tab. 1 and Fig. 3... Quantitative metrics are computed on 128 scenes from our Real Estate10K test split, and on all 44 scenes in LLFF.
Hardware Specification Yes We train 8-frame 4Di M models for 1M steps. Using 64 TPU v5e chips, we achieve a throughput of approximately 1 step per second with a batch size of 128.
Software Dependencies No The paper mentions the use of the Adam optimizer, bfloat16 activations, and techniques like FSDP and Ring Attention. However, it does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be necessary to replicate the experimental environment.
Experiment Setup Yes We train 8-frame 4Di M models for 1M steps. Using 64 TPU v5e chips, we achieve a throughput of approximately 1 step per second with a batch size of 128. All 8-frame models we train (i.e. base and super-resolution models) use this batch size, and we train all models with an Adam (Kingma & Ba, 2014) learning rate of 0.0001 linearly warmed up for the first 10,000 training steps, which we found was the best peak value in early sweeps. No learning rate decay is used. We follow Ho et al. (2020) and keep an exponential moving average of the parameters with a decay rate of 0.9999 to use at inference time for improved sample quality. The model is finetuned with the same number of chips, albeit only for 50,000 steps and at batch size of 32.