SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints
Authors: Jianhong Bai, Menghan Xia, Xintao WANG, Ziyang Yuan, Zuozhu Liu, Haoji Hu, Pengfei Wan, Di ZHANG
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that Syn Cam Master can generate consistent content from different viewpoints of the same scene, and achieves excellent inter-view synchronization. Ablation studies highlight the advantages of our key design choices. Furthermore, our method can be easily extended for novel view synthesis in videos by introducing a reference video to our multi-camera video generation model. Our contribution can be summarized as follows: ... Extensive experiments show the proposed Syn Cam Master outperforms baselines by a large margin. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Kuaishou Technology, 3Tsinghua University |
| Pseudocode | No | The paper describes the methodology using prose, mathematical equations (e.g., Eq. 1-7), and diagrams (e.g., Figure 2 for model overview), but does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github. com/Kwai VGI/Syn Cam Master. |
| Open Datasets | Yes | We also release a multi-view synchronized video dataset, named Syn Cam Video-Dataset. Our code is available at https://github. com/Kwai VGI/Syn Cam Master. ... DL3DV-10K (Ling et al., 2024) ... Real Estate-10K (Zhou et al., 2018) ... Human3.6M (Ionescu et al., 2013) ... Panoptic studio (Joo et al., 2015) ... Objaverse (Deitke et al., 2023) ... Co3D (Reizenstein et al., 2021) and MVImg Net (Yu et al., 2023) |
| Dataset Splits | No | The paper mentions data usage probabilities during training (e.g., "We joint train our model on multi-view video data, multi-view image data, and single-view video data with the probability of 0.6, 0.2, and 0.2 respectively") and describes the construction of an evaluation set ("We construct the evaluation set with 100 manually collected text prompts, and inference with 4 viewpoints each, resulting in 400 videos in total."). However, it does not provide specific training/validation/test splits (e.g., percentages or exact counts) for any of the datasets used to reproduce experiments. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions tools and frameworks like "UNet-based explorations", "Transformer-based scaling laws", "3D Variational Auto-Encoder (VAE)", "Rectified Flow framework", "SAM (Kirillov et al., 2023)", and "Co Tracker (Karaev et al., 2023)". However, it does not specify version numbers for any of these software components or libraries. |
| Experiment Setup | Yes | We joint train our model on multi-view video data, multi-view image data, and single-view video data with the probability of 0.6, 0.2, and 0.2 respectively. We train the model of 50K steps at the resolution of 384x672 with a learning rate of 0.0001, batch size 32. The view-attention module is initialized with the weight of the temporal-attention module, and the camera encoder and the projector are zero-initialized. |