VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

Authors: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate our method achieves state-of-the-art performance in realworld scenarios. Our code, data, and demos are available on the project page. (...) 4 EXPERIMENTS
Researcher Affiliation Academia 1 Re LER Lab, AAII, University of Technology Sydney 2 Re LER Lab, CCAI, Zhejiang University
Pseudocode No The paper describes the methodology using textual descriptions and diagrams (Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code, data, and demos are available on the project page. Project Page: https://knightyxp.github.io/Video Grain_project_page
Open Datasets Yes We evaluate our Video Grain using a dataset of 76 video-text pairs, including videos from DAVIS (Perazzi et al., 2016), TGVE1, and the Internet2 , with 16-32 frames per video. 1https://sites.google.com/view/loveucvpr23/track4 2https://www.istockphoto.com/ and https://www.pexels.com/
Dataset Splits No We evaluate our Video Grain using a dataset of 76 video-text pairs, including videos from DAVIS (Perazzi et al., 2016), TGVE1, and the Internet2 , with 16-32 frames per video. The paper does not provide specific details on how this dataset is split into training, validation, or test sets.
Hardware Specification Yes All The experiments are conducted on an NVIDIA A40 GPU.
Software Dependencies Yes In the experiment, we adopt the pretrained Stable Diffusion v1.5 as the base model, using 50 steps of DDIM inversion and denoising. Our Video Grain operates in a zero-shot manner, requiring no additional parameter tuning.
Experiment Setup Yes In the experiment, we adopt the pretrained Stable Diffusion v1.5 as the base model, using 50 steps of DDIM inversion and denoising. Our Video Grain operates in a zero-shot manner, requiring no additional parameter tuning. To enhance memory efficiency, we re-engineer slice attention within our ST Layout Attn. ST Layout Attn is applied during the first 15 denoising steps. We set ξ(t) = 0.3 t5 for self-attention and ξ(t) = t5 for cross-attention, where the timestep t [0, 1] is normalized. All The experiments are conducted on an NVIDIA A40 GPU.