VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing
Authors: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate our method achieves state-of-the-art performance in realworld scenarios. Our code, data, and demos are available on the project page. (...) 4 EXPERIMENTS |
| Researcher Affiliation | Academia | 1 Re LER Lab, AAII, University of Technology Sydney 2 Re LER Lab, CCAI, Zhejiang University |
| Pseudocode | No | The paper describes the methodology using textual descriptions and diagrams (Figure 4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code, data, and demos are available on the project page. Project Page: https://knightyxp.github.io/Video Grain_project_page |
| Open Datasets | Yes | We evaluate our Video Grain using a dataset of 76 video-text pairs, including videos from DAVIS (Perazzi et al., 2016), TGVE1, and the Internet2 , with 16-32 frames per video. 1https://sites.google.com/view/loveucvpr23/track4 2https://www.istockphoto.com/ and https://www.pexels.com/ |
| Dataset Splits | No | We evaluate our Video Grain using a dataset of 76 video-text pairs, including videos from DAVIS (Perazzi et al., 2016), TGVE1, and the Internet2 , with 16-32 frames per video. The paper does not provide specific details on how this dataset is split into training, validation, or test sets. |
| Hardware Specification | Yes | All The experiments are conducted on an NVIDIA A40 GPU. |
| Software Dependencies | Yes | In the experiment, we adopt the pretrained Stable Diffusion v1.5 as the base model, using 50 steps of DDIM inversion and denoising. Our Video Grain operates in a zero-shot manner, requiring no additional parameter tuning. |
| Experiment Setup | Yes | In the experiment, we adopt the pretrained Stable Diffusion v1.5 as the base model, using 50 steps of DDIM inversion and denoising. Our Video Grain operates in a zero-shot manner, requiring no additional parameter tuning. To enhance memory efficiency, we re-engineer slice attention within our ST Layout Attn. ST Layout Attn is applied during the first 15 denoising steps. We set ξ(t) = 0.3 t5 for self-attention and ξ(t) = t5 for cross-attention, where the timestep t [0, 1] is normalized. All The experiments are conducted on an NVIDIA A40 GPU. |