reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Authors: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David Lindell, Sergey Tulyakov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thoroughly evaluate this approach, including comparisons to previous camera control methods, which we adapt to the video transformer architecture. We show state-of-the-art results in camera-controllable video synthesis by applying the proposed conditioning method and fine-tuning scheme to the Snap Video-based model (Menapace et al., 2024). 4 EXPERIMENTS We provide a qualitative and quantitative assessment of our approach compared to the baselines in Fig. 4 and in Tab. 1.
Researcher Affiliation	Collaboration	Sherwin Bahmani1,2,3 Ivan Skorokhodov3 Aliaksandr Siarohin3 Willi Menapace3 Guocheng Qian3 Michael Vasilkovsky3 Hsin-Ying Lee3 Chaoyang Wang3 Jiaxu Zou3 Andrea Tagliasacchi1,4 David B. Lindell1,2 Sergey Tulyakov3 1University of Toronto 2Vector Institute 3Snap Inc. 4SFU
Pseudocode	No	The paper describes the method using mathematical formulations and architectural diagrams (e.g., Figure 3 and equation 3), but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Video results: https://snap-research.github.io/vd3d. To further facilitate reproducibility, we include the source code of our camera-controlled FIT block as supplementary material.
Open Datasets	Yes	To develop the camera control methods proposed in this paper, we used Real Estate10K (Zhou et al., 2018). Real Estate10K is released and open-sourced by Google LLC under a Creative Commons Attribution 4.0 International License, and sourced from content using a CC-BY license. The dataset can be found under the following URL: https://google.github.io/realestate10k. We use generations for text prompts from Real Estate10K (Zhou et al., 2018) and MSR-VTT (Xu et al., 2016).
Dataset Splits	Yes	The training split for fine-tuning consists of roughly 65K video clips, and is the same as is used in concurrent work (Motion Ctrl (Wang et al., 2023e) and Camera Ctrl (He et al., 2024a)). We evaluate our method using 20 camera trajectories sampled from the Real Estate10K test split that were not seen during training for the user study. We use the full test split combined with unseen text prompts for the automated camera evaluations, i.e., 6928 unseen camera trajectories combined with out-of-distribution text prompts.
Hardware Specification	Yes	A single training run for the smaller 700M parameter generator takes approximately 1 day on a node equipped with 8 NVIDIA A100 40GB GPUs, connected via NVIDIA NVLink, along with 960 GB of RAM and 92 Intel Xeon CPUs. The larger 4B parameter model was trained on 8 such nodes for 1,5 days, totaling 64 NVIDIA A100 40GB GPUs.
Software Dependencies	No	The paper mentions optimizers (LAMB, Adam W) and models (T5-11B) used, but does not provide specific version numbers for core software components or libraries like Python, PyTorch, TensorFlow, or CUDA, which are essential for full reproducibility.
Experiment Setup	Yes	Both models were trained with a batch size of 256 over 50,000 optimization steps with the LAMB optimizer (You et al., 2019). The learning rate was warmed up for the first 10,000 iterations from 0 to 0.005 and then linearly decreased to 0.0015 over subsequent iterations. The base Di T model is optimized using Adam W, with a learning rate of 0.0001 and weight decay of 0.01. It is trained for 750,000 iterations with a cosine learning rate scheduler in bfloat16.