VidEvo: Evolving Video Editing through Exhaustive Temporal Modeling
Authors: Sizhe Dang, Huan Liu, Mengmeng Wang, Xin Lai, Guang Dai, Jingdong Wang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations show that Vid Evo enhances frame-to-frame temporal consistency. Ablation studies confirm NVE and WFA s effectiveness and their plug-and-play capability with other methods. In this section, we present quantitative and qualitative analyses, ablation studies, and orthogonality analyses. Our method is primarily evaluated on the DAVIS [Pont-Tuset et al., 2017] dataset for comparison with existing works. |
| Researcher Affiliation | Collaboration | 1Xi an Jiaotong University 2Zhejiang University of Technology 3SGIT AI Lab, State Grid Corporation of China 4Baidu Inc. |
| Pseudocode | Yes | Algorithm 1 Vid Evo video editing |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository. There is no mention of code being included in supplementary materials. |
| Open Datasets | Yes | Our method is primarily evaluated on the DAVIS [Pont-Tuset et al., 2017] dataset for comparison with existing works. |
| Dataset Splits | No | The paper mentions using the DAVIS dataset but does not specify any particular training, validation, or test splits. It implies standard usage for comparison but provides no details on how the data was partitioned. |
| Hardware Specification | No | The paper discusses memory usage and runtime in Table 1 and Section 4.4, providing values like "10.4GB for pipeline memory and 19GB for tuning" or "17.2GB for pipeline memory alongside 6.6GB for NVE tuning." However, it does not specify any concrete hardware components such as specific GPU or CPU models used for these measurements or experiments. |
| Software Dependencies | No | The paper mentions general models like "Stable Diffusion Model" and "CLIP model" and a framework "P2P-based editing methods," but it does not specify any programming languages, libraries, or solvers with their respective version numbers that would be required for reproducibility. |
| Experiment Setup | Yes | Employing DDIM inversion with a default guidance scale of w = 7.5, our objective at each time step t is to minimize the following: \(z_{t-1} \leftarrow z_{t-1}(z^I_t, \phi_t, C)\) ... To address these issues, we propose the WFA mechanism as an alternative to traditional self-attention. As shown in Fig. 4, this mechanism uses a window size of \(\lambda\) (e.g., 3) to allow each token... Our ablation studies reveal that for videos with minimal motion, a window size of 3 for our WFA effectively maintains temporal consistency and achieves robust results without significant computational overhead. |