MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Authors: Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on both the Consistent4D (Jiang et al., 2023) dataset and a self-collected dataset to validate the effectiveness of our methods. The results demonstrate that our method generates videos with high-fidelity and high-quality motion on unseen views. 4 EXPERIMENT 4.1 EXPERIMENTAL SETTINGS 4.2 COMPARISONS 4.3 ABLATION STUDY
Researcher Affiliation Academia 1Shanghai Tech University, 2Sun Yat-sen University, 3The Hong Kong University of Science and Technology, 4The University of Hong Kong
Pseudocode No The paper describes the methodology using prose and includes an equation (Eq. 1) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project page: https://soolab.github.io/MVToken Flow. While a project page is provided, it does not explicitly state that source code for the methodology is available there, nor does it provide a direct link to a code repository.
Open Datasets Yes We conduct experiments on both the Consistent4D (Jiang et al., 2023) dataset and a self-collected dataset to validate the effectiveness of our methods. The dataset comprises 12 synthetic videos and 12 in-the-wild videos.
Dataset Splits No The paper mentions the dataset structure (12 synthetic videos, 12 in-the-wild videos, 32 frames) and a sampling strategy for training, but does not provide specific training/test/validation dataset splits (e.g., percentages or exact counts) for reproducibility.
Hardware Specification Yes All experiments are conducted on an NVIDIA A40 GPU.
Software Dependencies No The paper mentions using 'Era3D (Li et al., 2024b)' and 'RAFT (Teed & Deng, 2020)' but does not provide specific version numbers for these or any other software components.
Experiment Setup Yes For multi-view video generation, we utilize Era3D (Li et al., 2024b) to generate K = 6 viewpoints at a resolution of 512x512 for one frame, using 40 denoising steps. We set τ = 20, executing token propagation during the denoising process when t < τ. For keyframe selection, we employ a keyframe interval of 8 frames. In the initialization phase of dynamic 3D Gaussian representation, we initialize 512 control points with Farthest Point Sampling (FPS) sampling. Each Gaussian point is influenced by its 3 nearest control points. During the training of the dynamic Gaussian field, we use an initial learning rate of 3 10 4 for the MLP, followed by exponential decay. In the refinement phase of the dynamic 3D Gassuian field with regenerated multiview videos, we reset the learning rate to its initial value and applied the same decay strategy. Our training consists of a total of 30K iterations. We first use 5K iterations to learn a static 3D Gaussian from multiview images of a keyframe, which serves as the initialization for the dynamic 3D Gaussian representation. Next, we utilize 10K iterations to learn a coarse dynamic 3D Gaussian field from multiview videos. After regenerating the multiview videos with improved quality, we perform 15K iterations to refine and obtain the final dynamic 3D Gaussian field. For general cases, we set λr to 0.8, λDSSIM to 0.2, λm to 2, and the remaining hyperparameters to 1.