SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Authors: Jianhong Bai, Menghan Xia, Xintao WANG, Ziyang Yuan, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di ZHANG

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Syn Cam Master can generate consistent content from different viewpoints of the same scene, and achieves excellent inter-view synchronization. Ablation studies highlight the advantages of our key design choices. Furthermore, our method can be easily extended for novel view synthesis in videos by introducing a reference video to our multi-camera video generation model. Our contribution can be summarized as follows: ... Extensive experiments show the proposed Syn Cam Master outperforms baselines by a large margin.
Researcher Affiliation Collaboration 1Zhejiang University, 2Kuaishou Technology, 3Tsinghua University
Pseudocode No The paper describes the methodology using prose, mathematical equations (e.g., Eq. 1-7), and diagrams (e.g., Figure 2 for model overview), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github. com/Kwai VGI/Syn Cam Master.
Open Datasets Yes We also release a multi-view synchronized video dataset, named Syn Cam Video-Dataset. Our code is available at https://github. com/Kwai VGI/Syn Cam Master. ... DL3DV-10K (Ling et al., 2024) ... Real Estate-10K (Zhou et al., 2018) ... Human3.6M (Ionescu et al., 2013) ... Panoptic studio (Joo et al., 2015) ... Objaverse (Deitke et al., 2023) ... Co3D (Reizenstein et al., 2021) and MVImg Net (Yu et al., 2023)
Dataset Splits No The paper mentions data usage probabilities during training (e.g., "We joint train our model on multi-view video data, multi-view image data, and single-view video data with the probability of 0.6, 0.2, and 0.2 respectively") and describes the construction of an evaluation set ("We construct the evaluation set with 100 manually collected text prompts, and inference with 4 viewpoints each, resulting in 400 videos in total."). However, it does not provide specific training/validation/test splits (e.g., percentages or exact counts) for any of the datasets used to reproduce experiments.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions tools and frameworks like "UNet-based explorations", "Transformer-based scaling laws", "3D Variational Auto-Encoder (VAE)", "Rectified Flow framework", "SAM (Kirillov et al., 2023)", and "Co Tracker (Karaev et al., 2023)". However, it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes We joint train our model on multi-view video data, multi-view image data, and single-view video data with the probability of 0.6, 0.2, and 0.2 respectively. We train the model of 50K steps at the resolution of 384x672 with a learning rate of 0.0001, batch size 32. The view-attention module is initialized with the weight of the temporal-attention module, and the camera encoder and the projector are zero-initialized.