Disentangled Motion Modeling for Video Frame Interpolation

Authors: Jaihyun Lew, Jooyoung Choi, Chaehun Shin, Dahuin Jung, Sungroh Yoon

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments validate the effectiveness and efficiency of our proposed training scheme and architecture, demonstrating superior performance across various benchmarks in terms of perceptual metrics... Quantitative Results Tables 1 and 2 present our quantitative results across four benchmark datasets. Mo Mo achieves state-of-the-art on all four subsets of SNU-FILM... We conduct ablation studies to verify the effects of our design choices.
Researcher Affiliation Academia 1Interdisciplinary Program in AI, Seoul National University 2Department of Electrical and Computer Engineering, Seoul National University 3School of Computer Science and Engineering, Soongsil University 4AIIS, ASRI and INMC, Seoul National University EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed Mo Mo framework, its two-stage training process, and architectural details in sections 3.1, 3.2, and 3.3, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We train our model on the Vimeo90k dataset (Xue et al. 2019), using random 256 256 crops with augmentations like 90 rotation, flipping, and frame order reversing. We evaluate on well-known VFI benchmarks: Vimeo90k (Xue et al. 2019), SNU-FILM (Choi et al. 2020), Middlebury (others-set) (Baker et al. 2011), and Xiph (Montgomery and Lars 1994; Niklaus and Liu 2020), chosen for their broad motion diversity and magnitudes.
Dataset Splits Yes We train our model on the Vimeo90k dataset (Xue et al. 2019), using random 256 256 crops with augmentations like 90 rotation, flipping, and frame order reversing. We evaluate on well-known VFI benchmarks: Vimeo90k (Xue et al. 2019), SNU-FILM (Choi et al. 2020), Middlebury (others-set) (Baker et al. 2011), and Xiph (Montgomery and Lars 1994; Niklaus and Liu 2020), chosen for their broad motion diversity and magnitudes.
Hardware Specification Yes Runtime tests on a NVIDIA 32GB V100 GPU for 256 448 resolution frames averaged over 100 iterations reveal that our Convex-Up U-Net processes frames in approximately 145.49 ms each, achieving a 4.15 speedup over the standard U-Net and an 70 faster inference speed than the LDMVFI baseline.
Software Dependencies No We adopt pre-trained RAFT (Teed and Deng 2020) for optical flow model F... We use the standard timestep-conditioned U-Net architecture (UNet2DModel) from the diffusers library (von Platen et al. 2022)... The paper mentions specific software libraries and models like RAFT and the diffusers library, but it does not provide explicit version numbers for these components, which is required for reproducible software details.
Experiment Setup No Implementation Details We train our model on the Vimeo90k dataset (Xue et al. 2019), using random 256 256 crops with augmentations like 90 rotation, flipping, and frame order reversing. We recommend the reader to refer to the Appendix for further details. The paper describes the dataset used, cropping, and augmentations, along with the composition of the loss function Ls = λ1L1 + λp Lp + λGLG, but it does not provide specific hyperparameter values such as learning rates, batch sizes, optimizer settings, or the weights (λ values) for the loss components in the main text.