SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Authors: Yiming Xie, Chun-Han Yao, Vikram Voleti, Huaizu Jiang, Varun Jampani

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on multiple datasets and user studies demonstrate SV4D s state-of-the-art performance on novel-view video synthesis as well as 4D generation compared to prior works. We perform extensive comparisons of both novel view video synthesis and 4D generation results with respective state-of-the-art methods on datasets with synthetic (Objaverse Dy, Consistent4D (Jiang et al., 2024c)), and real-world (DAVIS (Perazzi et al., 2016; Pont-Tuset et al., 2017; Caelles et al., 2019)) data.
Researcher Affiliation Collaboration Yiming Xie1,2 Chun-Han Yao1 Vikram Voleti1 Huaizu Jiang2 Varun Jampani1 1 Stability AI 2 Northeastern University
Pseudocode No The paper describes the model architecture and sampling strategies using diagrams and text, but it does not contain a dedicated section or figure explicitly labeled as 'Pseudocode' or 'Algorithm' with structured steps.
Open Source Code Yes Project page: https://sv4d.github.io.
Open Datasets Yes To further train SV4D, we carefully curated a subject of Objaverse (Deitke et al., 2023b;a) dataset with dynamic 3D objects, resulting in the Objaverse Dy dataset. We perform extensive comparisons of both novel view video synthesis and 4D generation results with respective state-of-the-art methods on datasets with synthetic (Objaverse Dy, Consistent4D (Jiang et al., 2024c)), and real-world (DAVIS (Perazzi et al., 2016; Pont-Tuset et al., 2017; Caelles et al., 2019)) data.
Dataset Splits No We randomly sample 8 views and 5 frames from our 21 rendered views and 21 frames for training. For evaluation purposes, we excluded objects from the Consistent4D dataset from our training set to make a fair comparison. For user studies, we randomly select 10 real-world videos from the DAVIS dataset and 10 synthetic videos from the Objaverse or Consistent4D datasets. However, the paper does not specify concrete train/test/validation splits for the main datasets used in the quantitative evaluation (Objaverse Dy, Consistent4D).
Hardware Specification Yes We use an effective batch size of 16 during training on 2 nodes of 8 80GB H100 GPUs.
Software Dependencies No The paper mentions several models and frameworks like SV3D (Voleti et al., 2024), SVD-xt (Blattmann et al., 2023a), SD2.1 (Rombach et al., 2022) VAE, CLIP (Radford et al., 2021), Blender's CYCLES renderer, and the Adam optimizer (Kingma & Ba, 2014). However, it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We train SV4D on our Objaverse Dy dataset. We choose to finetune from the SV3Dp (Voleti et al., 2024) model to output 40 frames (F = 5 frames along each of the V = 8 views) with the spatial resolution 576 576, where the parameters in the frame attention layers are initialized from SVD-xt (Blattmann et al., 2023a). Similar to SV3D, we train SV4D progressively by first training on the static camera orbits for 40K iterations, then fine-tuning it for 20K iterations on the dynamic orbits. We use an effective batch size of 16 during training on 2 nodes of 8 80GB H100 GPUs. We render the dynamic Ne RF at 512 512 resolution and use an Adam (Kingma & Ba, 2014) optimizer to train all model parameters. The overall optimization takes roughly 15-20 minutes per object. We use an Adam optimizer with a learning rate of 0.01 for both stages. For training efficiency and stability, we follow a coarse-to-fine, static-to-dynamic strategy to optimize a 4D representation. That is, we first freeze the deformation field MLPt and only optimize the canonical Ne RF on the multi-view images of the first frame, while gradually increasing the rendering resolution from 128 128 to 512 512. Then, we unfreeze MLPt and randomly sample 4 frames 4 views for training.