Framer: Interactive Frame Interpolation

Authors: Wen Wang, Qiuyu Wang, Kecheng Zheng, Hao Ouyang, Zhekai Chen, Biao Gong, Hao Chen, Yujun Shen, Chunhua Shen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results demonstrate the appealing performance of Framer on various applications, such as image morphing, timelapse video generation, cartoon interpolation, etc. The code, model, and interface are publicly accessible at https://github.com/aim-uofa/Framer. We conduct extensive experiments to evaluate the performance of Framer across various applications, including image morphing, time-lapse video generation, and cartoon interpolation. The results demonstrate that Framer produces smooth and visually appealing transitions, outperforming existing methods, particularly in cases involving complex motions and significant appearance changes. By combining the strengths of generative models with user-guided interactions, Framer improves both the quality and controllability of the interpolated frames. 4 EXPERIMENTS 4.1 IMPLEMENTATION DETAILS 4.2 COMPARISON Qualitative Comparison. Quantitative Comparison. User Study. 4.4 ABLATIONS STUDIES
Researcher Affiliation Collaboration Wen Wang1,2, Qiuyu Wang2, Kecheng Zheng2, Hao Ouyang2, Zhekai Chen1, Biao Gong2, Hao Chen1, Yujun Shen2, Chunhua Shen3,1,2 1 Zhejiang University 2 Ant Group 3 Zhejiang University of Technology
Pseudocode Yes We provide an Algorithm in Alg. 1 to illustrate the point trajectory estimation method.
Open Source Code Yes The code, model, and interface are publicly accessible at https://github.com/aim-uofa/Framer.
Open Datasets Yes Our method is built on SVD and trained on the high-quality Open Vid HD-0.4M dataset (Nan et al., 2024). We conduct quantitative and qualitative analyses, as well as user studies, on two publicly available datasets: DAVIS (Pont-Tuset et al., 2017) and UCF101 (Soomro et al., 2012). Following the practice of Zhong et al. (2024), we conduct experiments on the Vimeo90K septuplet dataset (Xue et al., 2019), X4K1000FPS (Sim et al., 2021), and Adobe240 (Su et al., 2017).
Dataset Splits No Our method is built on SVD and trained on the high-quality Open Vid HD-0.4M dataset (Nan et al., 2024). Both DAVIS-7 and UCF101-7 are obtained by sampling 7 consecutive video frames from the corresponding datasets. We use all videos in the DAVIS dataset and a subset of 400 videos in the UCF101 dataset.
Hardware Specification Yes All training is performed on 16 NVIDIA A100 GPUs, and the total batch size is 16. The training takes about 2 days. During sampling, it takes about 4.64 seconds to generate 7 interpolated frames on the DAVIS-7 dataset. On average, it takes 0.67 seconds to produce a single interpolated frame. The latency for generating 7 intermediate frames is assessed on the NVIDIA A6000 GPU, using seconds as the measurement metric.
Software Dependencies No Our approach builds on the video diffusion model to exploit this prior. Considering that the Image-to-Video (I2V) diffusion model naturally supports first-frame conditioning, we choose the representative I2V diffusion model, Stable Video Diffusion (SVD) (Blattmann et al., 2023a), as our base model, as shown in Fig. 2d. The model is trained for 100K iterations using the Adam W optimizer Loshchilov & Hutter (2019) with a learning rate of 1e 5. We obtained the point trajectories by preprocessing the video using the Co-Tracker (Karaev et al., 2023). We follow the conditioning mechanism in Control Net (Zhang et al., 2023b) to incorporate the trajectory control.
Experiment Setup Yes The model is trained for 100K iterations using the Adam W optimizer Loshchilov & Hutter (2019) with a learning rate of 1e 5. When training the control module, we fixed the U-Net and optimized the control module for 10K steps using the Adam W optimizer, with a learning rate of 1e 5. All training is performed on 16 NVIDIA A100 GPUs, and the total batch size is 16. During autopilot mode sampling, we keep m = 5 best matching keypoints for trajectory guidance, and the distance thresholds for point tracking are set as r1 = 5 and r2 = 3. We sample 14 consecutive frames from videos, with a spatial resolution of 512 320. Specifically, we center-crop the video to an aspect ratio of 512/320, then resize the video frames to the resolution of 512 320. Random horizontal flip is utilized for data augmentation. We sample the video in temporal dimension, with a frame interval of 2. For the training of the point trajectory-based Control Net, we sample 1 to 10 trajectories with larger motions for training. During autopilot mode sampling, we use the Euler sampler with 30 diffusion steps in total.