Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention
Authors: Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, Hao Tang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality. Project page: https://ir1d.github.io/Cavia/ 5. Experiments In this section, we present experimental results and analysis. Video comparisons are provided on the project webpage for optimal visual evaluation. 5.1. Quantitative Comparisons 5.2. Qualtitative Comparison 5.3. Ablation Studies and Applications |
| Researcher Affiliation | Collaboration | Dejia Xu 1 * Yifan Jiang 2 Chen Huang 2 Liangchen Song 2 Thorsten Gernoth 2 Liangliang Cao 2 Zhangyang Wang 1 Hao Tang 2 *This work was performed while Dejia Xu interned at Apple. 1The University of Texas at Austin 2Apple. Correspondence to: Hao Tang <hao EMAIL>. |
| Pseudocode | No | The paper describes the methodology in detail across sections 3 and 4, including '3.2. Camera Controllable Video Diffusion Model' and '3.3. Consistent Multi-view Video Diffusion Model'. It provides explanations of the architecture, attention mechanisms, and training strategy, but does not present a distinct pseudocode block or algorithm. |
| Open Source Code | No | Project page: https://ir1d.github.io/Cavia/ While a project page is provided, the paper does not contain an explicit statement like "We release our code" or a direct link to a code repository. The project page itself is not a code repository. |
| Open Datasets | Yes | Wild RGBD (Xia et al., 2024) includes nearly 20,000 RGB-D videos across 46 common object categories. MVImg Net (Yu et al., 2023) comprises 219,188 videos featuring objects from 238 classes. DL3DV-10K (Ling et al., 2023b) provides 7,000 long-duration videos captured in both indoor and outdoor environments. CO3Dv2 (Reizenstein et al., 2021) contains 34,000 turntable-like videos of rigid objects... Objaverse (Deitke et al., 2023b) and Objaverse-XL (Deitke et al., 2023a)... sourced from Intern Vid (Wang et al., 2023b) and Open Vid (Nan et al., 2024) datasets |
| Dataset Splits | Yes | We randomly sample 1,000 video sequences from Real Estate10K (Zhou et al., 2018) test set for evaluation. We randomly sample 1,000 videos, each with 27 frames, from Real Estate10k (Zhou et al., 2018) test set and convert each video into a two-view sequence with 14 frames per view. For Real Estate10k, we use the train/test split released by Pixel Splat (Charatan et al., 2023). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | Our model builds on pre-trained Stable Video Diffusion (SVD) (Stability, 2023). SVD extends Stable Diffusion 2.1(Rombach et al., 2022) by adding temporal convolution and attention layers, following the Video LDM architecture (Blattmann et al., 2023). We use clean-fid and common-metrics-onvideo-quality for obtaining FID and FVD, respectively. Our FVD results are reported in Video GPT (Yan et al., 2021) format. Our COLMAP is configured following DSNe RF (Deng et al., 2022) and (Xu et al., 2024). We enable --Sift Matching.max num matches 65536 to support robust feature matching. We use Blender s EEVEE renderer to render an 84-frame RGBA orbit at 512 x 512 resolution. Optical flow is first obtained using cv2.calc Optical Flow Farneback for each consecutive frame pairs. Then, the magnitudes and directions are calculated via cv2.cart To Polar. We utilize lpips (Zhang et al., 2018) to measure the similarity of nearby frames. |
| Experiment Setup | Yes | Our training is divided into static stage and dynamic stage. Our static stage is trained for around 500k iterations and our dynamic stage is trained for roughly 300k iterations. The effective batch size is 128 and the learning rate is 1e-4. Our video length is 14 frames for each view with the first frame shared across views. Our model is fine-tuned at 256 x 256 spatial resolution from the SVD 1.0 checkpoint. The training data are prepared by first center-cropping the original videos and then resizing each frame to the shape of 256 x 256. In the dynamic stage, 30% of iterations are used to train on monocular videos. During static training, the strides of frames are randomly sampled in the range of [1, 8]. For monocular videos, the strides are sampled in the range of [1, 2]. For dynamic multi-view object renderings, the strides are fixed to 1 to use all rendered frames since we already introduced randomness in the frame rate during rendering. At inference time, the decoding chunk is set to 14 so all frames are decoded altogether. We sample 25 steps to obtain all our results. |