VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control
Authors: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David Lindell, Sergey Tulyakov
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thoroughly evaluate this approach, including comparisons to previous camera control methods, which we adapt to the video transformer architecture. We show state-of-the-art results in camera-controllable video synthesis by applying the proposed conditioning method and fine-tuning scheme to the Snap Video-based model (Menapace et al., 2024). 4 EXPERIMENTS We provide a qualitative and quantitative assessment of our approach compared to the baselines in Fig. 4 and in Tab. 1. |
| Researcher Affiliation | Collaboration | Sherwin Bahmani1,2,3 Ivan Skorokhodov3 Aliaksandr Siarohin3 Willi Menapace3 Guocheng Qian3 Michael Vasilkovsky3 Hsin-Ying Lee3 Chaoyang Wang3 Jiaxu Zou3 Andrea Tagliasacchi1,4 David B. Lindell1,2 Sergey Tulyakov3 1University of Toronto 2Vector Institute 3Snap Inc. 4SFU |
| Pseudocode | No | The paper describes the method using mathematical formulations and architectural diagrams (e.g., Figure 3 and equation 3), but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Video results: https://snap-research.github.io/vd3d. To further facilitate reproducibility, we include the source code of our camera-controlled FIT block as supplementary material. |
| Open Datasets | Yes | To develop the camera control methods proposed in this paper, we used Real Estate10K (Zhou et al., 2018). Real Estate10K is released and open-sourced by Google LLC under a Creative Commons Attribution 4.0 International License, and sourced from content using a CC-BY license. The dataset can be found under the following URL: https://google.github.io/realestate10k. We use generations for text prompts from Real Estate10K (Zhou et al., 2018) and MSR-VTT (Xu et al., 2016). |
| Dataset Splits | Yes | The training split for fine-tuning consists of roughly 65K video clips, and is the same as is used in concurrent work (Motion Ctrl (Wang et al., 2023e) and Camera Ctrl (He et al., 2024a)). We evaluate our method using 20 camera trajectories sampled from the Real Estate10K test split that were not seen during training for the user study. We use the full test split combined with unseen text prompts for the automated camera evaluations, i.e., 6928 unseen camera trajectories combined with out-of-distribution text prompts. |
| Hardware Specification | Yes | A single training run for the smaller 700M parameter generator takes approximately 1 day on a node equipped with 8 NVIDIA A100 40GB GPUs, connected via NVIDIA NVLink, along with 960 GB of RAM and 92 Intel Xeon CPUs. The larger 4B parameter model was trained on 8 such nodes for 1,5 days, totaling 64 NVIDIA A100 40GB GPUs. |
| Software Dependencies | No | The paper mentions optimizers (LAMB, Adam W) and models (T5-11B) used, but does not provide specific version numbers for core software components or libraries like Python, PyTorch, TensorFlow, or CUDA, which are essential for full reproducibility. |
| Experiment Setup | Yes | Both models were trained with a batch size of 256 over 50,000 optimization steps with the LAMB optimizer (You et al., 2019). The learning rate was warmed up for the first 10,000 iterations from 0 to 0.005 and then linearly decreased to 0.0015 over subsequent iterations. The base Di T model is optimized using Adam W, with a learning rate of 0.0001 and weight decay of 0.01. It is trained for 750,000 iterations with a cosine learning rate scheduler in bfloat16. |