Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Authors: Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method with quantitative metrics, such as CLIP score (Radford et al., 2021) and warp error (Lai et al., 2018), and a user study. Our findings indicate that users significantly favor our Stream V2V over Stream Diffusion (Kodaira et al., 2023) (with over 70% win rates) and Co De F (Ouyang et al., 2023) (with over 80% win rates).
Researcher Affiliation Academia 1 UT Austin 2 UC Berkeley EMAIL, EMAIL
Pseudocode Yes A.5.2 PSEUDO CODE OF DYNAMIC MERGING 1 import torch 2 import torch.nn.functional as F 4 def dynamic_merge(current_frame, feature_bank):
Open Source Code Yes Demo, code, and models are available on the project page. https://jeff-liangf.github.io/projects/streamv2v
Open Datasets Yes Following Token Flow (Geyer et al., 2023) and Flow Vid (Liang et al., 2023), we build our user study by selecting 19 object-centric videos from the DAVIS trainval 2017 dataset (Pont-Tuset et al., 2017), covering diverse subjects such as humans and animals.
Dataset Splits No Following Token Flow (Geyer et al., 2023) and Flow Vid (Liang et al., 2023), we build our user study by selecting 19 object-centric videos from the DAVIS trainval 2017 dataset (Pont-Tuset et al., 2017)...
Hardware Specification Yes Stream V2V can run 20 FPS on one A100 GPU, being 15 , 46 , 108 , and 158 faster than Flow Vid, Co De F, Rerender, and Token Flow, respectively.
Software Dependencies No We built our method on Stream Diffusion (Kodaira et al., 2023) with Latent Consistency Model (Luo et al., 2023b). By default, we use a 4-step LCM without the classifier-free guidance (Ho & Salimans, 2022). We continue to use x Formers (Lefaudeux et al., 2022) for fair comparison with existing methods.
Experiment Setup Yes By default, we use a 4-step LCM without the classifier-free guidance (Ho & Salimans, 2022). We update the feature bank every 4 frames. The underlying image-to-image method is SDEdit (Meng et al., 2021), with an initial noise strength of 0.4.