ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
Authors: Serin Yang, Taesung Kwon, Jong Chul YE
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted a comparative study with four different keyframe interpolation baselines, including FILM (Reda et al., 2019), a conventional flow-based frame interpolation method, and three frame interpolation methods based on video diffusion models: TRF (Feng et al., 2024), Dynami Crafter (Xing et al., 2023), and Generative Inbetweening (Wang et al., 2024). We conducted these studies using the official implementations with default values, except for TRF, which has not been open-sourced yet. Qualitative evaluation. As illustrated in Fig. 4, our model clearly outperforms the other methods in terms of motion consistency and identity preservation. Quantitative evaluation. For quantitative evaluation, we used LPIPS (Zhang et al., 2018) and FID (Heusel et al., 2017) to assess the quality of the generated frames, and FVD (Unterthiner et al., 2019) to evaluate the overall quality of the generated videos. As shown in Table 1, our method surpasses the other baselines in terms of fidelity. |
| Researcher Affiliation | Academia | Serin Yang1 , Taesung Kwon2 , Jong Chul Ye1 1Kim Jaechul Graduate School of AI, KAIST 2Dept. of Bio & Brain Engineering, KAIST EMAIL |
| Pseudocode | Yes | The detailed algorithm is provided in Algorithm 1. The vanilla bidirectional sampling can be implemented by removing DDS guidance (orange) and replacing the CFG++ update (blue) with a traditional CFG update. The detailed algorithm of the vanilla bidirectional sampling is provided in Appendix A. Algorithm 1 Vi Bi DSampler |
| Open Source Code | No | Project page: https://vibidsampler.github.io/ (Unofficial implementation: https://github.com/Ying Huan-Chen/Time-Reversal - for TRF, not for our method) |
| Open Datasets | Yes | Dataset. The high-resolution (1080p) video datasets used for evaluation are sourced from the DAVIS dataset (Pont-Tuset et al., 2017) and the Pexels dataset1. For the DAVIS dataset, we preprocessed 100 videos into 100 video-keyframe pairs, with each video consisting of 25 frames. This dataset includes a wide range of large and varied motions, such as surfing, dancing, driving, and airplane flying. For the Pexels dataset, we collected 45 videos, primarily featuring scene motions, natural movements, directional animal movements, and sports actions. We used the first and last frames from each video as keyframes for our evaluation. 1https://www.pexels.com/ |
| Dataset Splits | No | The paper mentions preprocessing 100 videos from DAVIS and collecting 45 videos from Pexels, but does not specify training/test/validation splits for these datasets for the experimental evaluation. It states 'We used the first and last frames from each video as keyframes for our evaluation' but not how the datasets themselves were partitioned for evaluation. |
| Hardware Specification | Yes | On a single 3090 GPU, our method can interpolate 25 frames at 1024 576 resolution in just 195 seconds, establishing it as a leading solution for keyframe interpolation. All evaluations were performed on a single NVIDIA RTX 3090. |
| Software Dependencies | No | The paper mentions using specific schedulers and diffusion models like "Euler scheduler" and "Stable Video Diffusion (SVD)" within the "EDM-framework", but it does not provide specific version numbers for these software components or any programming languages or libraries (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | For the sampling process, we used the Euler scheduler with 25 timesteps for both forward and backward sampling. The motion bucket ID was fixed at 127, and the decoding frame number was set to 4 due to memory limitations on an NVIDIA RTX 3090 GPU. All other parameters followed the default settings from SVD. Since micro-condition fps is sensitive to the data, we applied a lower fps for cases with large motion and a higher fps for cases with smaller motion. Figure 6: Effect of CFG++ guidance scale. The rows, from top to bottom, correspond to the CFG++ scales of 0.6, 0.8, and 1.0. |