SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency
Authors: Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, Varun Jampani
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on multiple datasets and user studies demonstrate SV4D s state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works. We perform extensive comparisons of both novel view video synthesis and 4D generation results with respective state-of-the-art methods on datasets with synthetic (Objaverse Dy, Consistent4D (Jiang et al., 2024c)), and real-world (DAVIS (Perazzi et al., 2016; Pont-Tuset et al., 2017; Caelles et al., 2019)) data. |
| Researcher Affiliation | Collaboration | Yiming Xie1,2 Chun-Han Yao1 Vikram Voleti1 Huaizu Jiang2 Varun Jampani1 1 Stability AI 2 Northeastern University |
| Pseudocode | No | The paper describes the model architecture and sampling strategies using diagrams and text, but it does not contain a dedicated section or figure explicitly labeled as 'Pseudocode' or 'Algorithm' with structured steps. |
| Open Source Code | Yes | Project page: https://sv4d.github.io. |
| Open Datasets | Yes | To further train SV4D, we carefully curated a subject of Objaverse (Deitke et al., 2023b;a) dataset with dynamic 3D objects, resulting in the Objaverse Dy dataset. We perform extensive comparisons of both novel view video synthesis and 4D generation results with respective state-of-the-art methods on datasets with synthetic (Objaverse Dy, Consistent4D (Jiang et al., 2024c)), and real-world (DAVIS (Perazzi et al., 2016; Pont-Tuset et al., 2017; Caelles et al., 2019)) data. |
| Dataset Splits | No | We randomly sample 8 views and 5 frames from our 21 rendered views and 21 frames for training. For evaluation purposes, we excluded objects from the Consistent4D dataset from our training set to make a fair comparison. For user studies, we randomly select 10 real-world videos from the DAVIS dataset and 10 synthetic videos from the Objaverse or Consistent4D datasets. However, the paper does not specify concrete train/test/validation splits for the main datasets used in the quantitative evaluation (Objaverse Dy, Consistent4D). |
| Hardware Specification | Yes | We use an effective batch size of 16 during training on 2 nodes of 8 80GB H100 GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks like SV3D (Voleti et al., 2024), SVD-xt (Blattmann et al., 2023a), SD2.1 (Rombach et al., 2022) VAE, CLIP (Radford et al., 2021), Blender's CYCLES renderer, and the Adam optimizer (Kingma & Ba, 2014). However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We train SV4D on our Objaverse Dy dataset. We choose to finetune from the SV3Dp (Voleti et al., 2024) model to output 40 frames (F = 5 frames along each of the V = 8 views) with the spatial resolution 576 576, where the parameters in the frame attention layers are initialized from SVD-xt (Blattmann et al., 2023a). Similar to SV3D, we train SV4D progressively by first training on the static camera orbits for 40K iterations, then fine-tuning it for 20K iterations on the dynamic orbits. We use an effective batch size of 16 during training on 2 nodes of 8 80GB H100 GPUs. We render the dynamic Ne RF at 512 512 resolution and use an Adam (Kingma & Ba, 2014) optimizer to train all model parameters. The overall optimization takes roughly 15-20 minutes per object. We use an Adam optimizer with a learning rate of 0.01 for both stages. For training efficiency and stability, we follow a coarse-to-fine, static-to-dynamic strategy to optimize a 4D representation. That is, we first freeze the deformation field MLPt and only optimize the canonical Ne RF on the multi-view images of the first frame, while gradually increasing the rendering resolution from 128 128 to 512 512. Then, we unfreeze MLPt and randomly sample 4 frames 4 views for training. |