reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Space and Time with Diffusion Models

Authors: Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David Fleet

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that 4Di M outperforms prior 3D NVS models both in terms of image ﬁdelity and pose alignment, while also enabling the generation of scene dynamics. We include extensive evaluation and comparisons to prior work, including ﬁxes to Sf M-based metrics (He et al., 2024), and a novel keypoint distance metric to detect the presence of dynamics.
Researcher Affiliation	Industry	Daniel Watson , Saurabh Saxena , Lala Li , Andrea Tagliasacchi, David Fleet Google Deep Mind
Pseudocode	No	The paper contains architectural diagrams (Fig. 2) and mathematical formulations (e.g., Equation 1), but no explicitly labeled pseudocode or algorithm blocks describing a method or procedure in a structured, code-like format.
Open Source Code	No	The paper mentions 'See https://4d-diffusion.github.io for video samples.' multiple times (abstract, Figure 3, Figure 6, Section J). This URL is presented as a destination for video samples, not for the source code of the methodology described in the paper. There is no explicit statement or link indicating that the code is publicly available.
Open Datasets	Yes	Figure 1: Zero-shot samples from 4Di M on LLFF (Mildenhall et al., 2019) and Davis (Pont Tuset et al., 2017)... and The 3D datasets used to train 4Di M include Scan Net++ (Yeshwanth et al., 2023) and Matterport3D (Chang et al., 2017)... and One particularly rich dataset for training 3D models is Real Estate10K (Zhou et al., 2018) (RE10K).
Dataset Splits	Yes	We use 1% of the dataset as our validation split and compute metrics on all baselines ourselves for this split, noting they might instead be advantaged as our test data may exist in their training data (for PNVS, all our test data is in fact part of their training dataset). For the OOD case, we use the LLFF dataset (Mildenhall et al., 2019). We present our main results on 3D novel view synthesis conditioned on a single image in Tab. 1 and Fig. 3... Quantitative metrics are computed on 128 scenes from our Real Estate10K test split, and on all 44 scenes in LLFF.
Hardware Specification	Yes	We train 8-frame 4Di M models for 1M steps. Using 64 TPU v5e chips, we achieve a throughput of approximately 1 step per second with a batch size of 128.
Software Dependencies	No	The paper mentions the use of the Adam optimizer, bfloat16 activations, and techniques like FSDP and Ring Attention. However, it does not provide specific version numbers for any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA) that would be necessary to replicate the experimental environment.
Experiment Setup	Yes	We train 8-frame 4Di M models for 1M steps. Using 64 TPU v5e chips, we achieve a throughput of approximately 1 step per second with a batch size of 128. All 8-frame models we train (i.e. base and super-resolution models) use this batch size, and we train all models with an Adam (Kingma & Ba, 2014) learning rate of 0.0001 linearly warmed up for the ﬁrst 10,000 training steps, which we found was the best peak value in early sweeps. No learning rate decay is used. We follow Ho et al. (2020) and keep an exponential moving average of the parameters with a decay rate of 0.9999 to use at inference time for improved sample quality. The model is ﬁnetuned with the same number of chips, albeit only for 50,000 steps and at batch size of 32.