VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Authors: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David Lindell, Sergey Tulyakov

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We thoroughly evaluate this approach, including comparisons to previous camera control methods, which we adapt to the video transformer architecture. We show state-of-the-art results in camera-controllable video synthesis by applying the proposed conditioning method and fine-tuning scheme to the Snap Video-based model (Menapace et al., 2024). 4 EXPERIMENTS We provide a qualitative and quantitative assessment of our approach compared to the baselines in Fig. 4 and in Tab. 1.
Researcher Affiliation Collaboration Sherwin Bahmani1,2,3 Ivan Skorokhodov3 Aliaksandr Siarohin3 Willi Menapace3 Guocheng Qian3 Michael Vasilkovsky3 Hsin-Ying Lee3 Chaoyang Wang3 Jiaxu Zou3 Andrea Tagliasacchi1,4 David B. Lindell1,2 Sergey Tulyakov3 1University of Toronto 2Vector Institute 3Snap Inc. 4SFU
Pseudocode No The paper describes the method using mathematical formulations and architectural diagrams (e.g., Figure 3 and equation 3), but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Video results: https://snap-research.github.io/vd3d. To further facilitate reproducibility, we include the source code of our camera-controlled FIT block as supplementary material.
Open Datasets Yes To develop the camera control methods proposed in this paper, we used Real Estate10K (Zhou et al., 2018). Real Estate10K is released and open-sourced by Google LLC under a Creative Commons Attribution 4.0 International License, and sourced from content using a CC-BY license. The dataset can be found under the following URL: https://google.github.io/realestate10k. We use generations for text prompts from Real Estate10K (Zhou et al., 2018) and MSR-VTT (Xu et al., 2016).
Dataset Splits Yes The training split for fine-tuning consists of roughly 65K video clips, and is the same as is used in concurrent work (Motion Ctrl (Wang et al., 2023e) and Camera Ctrl (He et al., 2024a)). We evaluate our method using 20 camera trajectories sampled from the Real Estate10K test split that were not seen during training for the user study. We use the full test split combined with unseen text prompts for the automated camera evaluations, i.e., 6928 unseen camera trajectories combined with out-of-distribution text prompts.
Hardware Specification Yes A single training run for the smaller 700M parameter generator takes approximately 1 day on a node equipped with 8 NVIDIA A100 40GB GPUs, connected via NVIDIA NVLink, along with 960 GB of RAM and 92 Intel Xeon CPUs. The larger 4B parameter model was trained on 8 such nodes for 1,5 days, totaling 64 NVIDIA A100 40GB GPUs.
Software Dependencies No The paper mentions optimizers (LAMB, Adam W) and models (T5-11B) used, but does not provide specific version numbers for core software components or libraries like Python, PyTorch, TensorFlow, or CUDA, which are essential for full reproducibility.
Experiment Setup Yes Both models were trained with a batch size of 256 over 50,000 optimization steps with the LAMB optimizer (You et al., 2019). The learning rate was warmed up for the first 10,000 iterations from 0 to 0.005 and then linearly decreased to 0.0015 over subsequent iterations. The base Di T model is optimized using Adam W, with a learning rate of 0.0001 and weight decay of 0.01. It is trained for 750,000 iterations with a cosine learning rate scheduler in bfloat16.