Looking Backward: Streaming Video-to-Video Translation with Feature Banks
Authors: Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method with quantitative metrics, such as CLIP score (Radford et al., 2021) and warp error (Lai et al., 2018), and a user study. Our findings indicate that users significantly favor our Stream V2V over Stream Diffusion (Kodaira et al., 2023) (with over 70% win rates) and Co De F (Ouyang et al., 2023) (with over 80% win rates). |
| Researcher Affiliation | Academia | 1 UT Austin 2 UC Berkeley EMAIL, EMAIL |
| Pseudocode | Yes | A.5.2 PSEUDO CODE OF DYNAMIC MERGING 1 import torch 2 import torch.nn.functional as F 4 def dynamic_merge(current_frame, feature_bank): |
| Open Source Code | Yes | Demo, code, and models are available on the project page. https://jeff-liangf.github.io/projects/streamv2v |
| Open Datasets | Yes | Following Token Flow (Geyer et al., 2023) and Flow Vid (Liang et al., 2023), we build our user study by selecting 19 object-centric videos from the DAVIS trainval 2017 dataset (Pont-Tuset et al., 2017), covering diverse subjects such as humans and animals. |
| Dataset Splits | No | Following Token Flow (Geyer et al., 2023) and Flow Vid (Liang et al., 2023), we build our user study by selecting 19 object-centric videos from the DAVIS trainval 2017 dataset (Pont-Tuset et al., 2017)... |
| Hardware Specification | Yes | Stream V2V can run 20 FPS on one A100 GPU, being 15 , 46 , 108 , and 158 faster than Flow Vid, Co De F, Rerender, and Token Flow, respectively. |
| Software Dependencies | No | We built our method on Stream Diffusion (Kodaira et al., 2023) with Latent Consistency Model (Luo et al., 2023b). By default, we use a 4-step LCM without the classifier-free guidance (Ho & Salimans, 2022). We continue to use x Formers (Lefaudeux et al., 2022) for fair comparison with existing methods. |
| Experiment Setup | Yes | By default, we use a 4-step LCM without the classifier-free guidance (Ho & Salimans, 2022). We update the feature bank every 4 frames. The underlying image-to-image method is SDEdit (Meng et al., 2021), with an initial noise strength of 0.4. |