EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
Authors: Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this gap, we propose Edit Board, the first comprehensive evaluation benchmark for text-based video editing models. Edit Board encompasses nine automatic metrics across four key dimensions, evaluating models on four categories of tasks, and introduces three new metrics to assess fidelity. This task-oriented framework facilitates objective evaluation by breaking down model performance into details, providing insights into each model s strengths and weaknesses. By open-sourcing Edit Board, we aim to standardize evaluation and advance the development of robust video editing models. Additionally, we utilize Edit Board to evaluate five state-of-the-art video editing models, deriving valuable insights from the results. This section presents the evaluation experiment conducted using Edit Board, along with the human alignment test, which is specifically designed to validate the correlation between automatic metrics and human perception. We rigorously evaluate five state-of-the-art video editing models: Fate Zero (Qi et al. 2023), Control-A-Video (Chen et al. 2023b), Ground-A-Video (Jeong and Ye 2023), Token Flow (Geyer et al. 2023), and Video-P2P (Liu et al. 2024). For each model, Edit Board generates a detailed transcript that reports performance across each dimension and task type. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong, Shenzhen 2Nanjing University 3University of Leeds EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the mathematical formulation of the problem and various metrics, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Samchen2003/Edit Board |
| Open Datasets | Yes | We utilize samples from the DAVIS dataset (Pont-Tuset et al. 2017) to obtain the masks required for conducting Semantic Score testing. We also select samples from the LOVEUTGVE-2023 dataset (Wu et al. 2023b). |
| Dataset Splits | No | For each task, we select 10 videos containing a variety of objects, such as cars, animals, and humans. Each original video is paired with at least two target prompts according to the task category. For generating source prompts, we employ BLIP-2 (Li et al. 2023) for the automated generation of video captions. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models. We also ensure that sufficient original videos meet the requirements for applying FF-α, allowing for the full adoption of both FF-α and FF-β. Additionally, we adjust the target prompts for Single Object Single Attribute (SOSA) so that more than half of the edits focus on the foreground object to facilitate Semantic Score evaluation. The paper describes the selection of videos for testing and human alignment but does not specify formal training/test/validation splits with percentages or counts for reproducing a model training process or the benchmark's internal data partitioning. |
| Hardware Specification | Yes | The experiments are conducted on a single NVIDIA Ge Force RTX 4090. |
| Software Dependencies | Yes | To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models. |
| Experiment Setup | Yes | To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models. |