IV-mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis
Authors: Shitong Shao, zikai zhou, Lichen Bai, Haoyi Xiong, Zeke Xie
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150/1649, and VBench. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0. Our code is released at https://github.com/xie-lab-ml/IV-mixed-Sampler. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology (Guangzhou) 2Baidu Inc. EMAIL EMAIL |
| Pseudocode | Yes | 1) we construct IV-mixed Sampler under a rigorous mathematical framework and demonstrate, through theoretical analysis, that it can be elegantly transformed into a standard inverse ordinary differential equation (ODE) process. For the sake of intuition, we present IV-mixed Sampler (w.r.t., IV-IV ) on Fig. 2 and its pseudo code in Appendix B. |
| Open Source Code | Yes | Our code is released at https://github.com/xie-lab-ml/IV-mixed-Sampler. Our project page can be found in https://klayand.github.io/IVmixed Sampler. |
| Open Datasets | Yes | Our experiments have demonstrated that IV-Mixed Sampler achieves state-of-the-art performance on 4 benchmarks including UCF-101-FVD, MSR-VTT-FVD, Chronomagic-Bench-150/1649, and VBench. For example, the open-source Animatediff with IV-Mixed Sampler reduces the UMT-FVD score from 275.2 to 228.6, closing to 223.1 from the closed-source Pika-2.0. Our code is released at https://github.com/xie-lab-ml/IV-mixed-Sampler. |
| Dataset Splits | Yes | For our evaluation, we utilize all 497 validation videos. To ensure evaluation stability, we synthesize a total of 1,491 videos based on prompts from these validation videos, with each prompt producing 3 different videos. Specifically, we synthesize 5 videos for each of the 101 prompts provided by Ge et al. (2023), resulting in a total of 505 synthesized videos. We then compute the FVD between these 505 synthesized videos and 505 randomly sampled videos from the UCF-101 dataset (5 per class), using the built-in FVD evaluation code from Open-Sora-Plan2. |
| Hardware Specification | Yes | In the practical implementation, the computational overhead went up from 21s to 92s at a single RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions 'PyTorch-like style' for pseudocode, but does not provide specific version numbers for PyTorch or other software libraries. |
| Experiment Setup | Yes | For all comparison experiments, we used the form IV-IV and perform IV-mixed Sampler at all time steps of the standard DDIM sampling. In addition, γt=0 go , γt=0 back, γt=1 go and γt=1 back all are set as 4. For both Animatediff and Model Scope-T2V, we use stable diffusion (SD) V1.5 as the IDM. Note that we experimented with using Mini SD as the IDM for Model Scope-T2V to maintain a consistent resolution of 256 256. However, as illustrated in Table 6, we found that its performance was inferior to using SD V1.5 with upsampling and downsampling. For Video Crafter V2, we use Realistic Vision V6.0 B1 (Mage.Space, 2023) as the IDM to accommodate a resolution of 512 320. For the remaining configurations, we follow the sampling form recommended by the corresponding VDMs. Furthermore, we find that applying IV-IV at every step on Video Crafter V2 destroys temporal coherence. Therefore, we replace IV-IV with VV-VV for z%. The results of the ablation experiments are shown in Table 7. We finally chose z%=66.7% as the final solution. |