EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models

Authors: Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this gap, we propose Edit Board, the first comprehensive evaluation benchmark for text-based video editing models. Edit Board encompasses nine automatic metrics across four key dimensions, evaluating models on four categories of tasks, and introduces three new metrics to assess fidelity. This task-oriented framework facilitates objective evaluation by breaking down model performance into details, providing insights into each model s strengths and weaknesses. By open-sourcing Edit Board, we aim to standardize evaluation and advance the development of robust video editing models. Additionally, we utilize Edit Board to evaluate five state-of-the-art video editing models, deriving valuable insights from the results. This section presents the evaluation experiment conducted using Edit Board, along with the human alignment test, which is specifically designed to validate the correlation between automatic metrics and human perception. We rigorously evaluate five state-of-the-art video editing models: Fate Zero (Qi et al. 2023), Control-A-Video (Chen et al. 2023b), Ground-A-Video (Jeong and Ye 2023), Token Flow (Geyer et al. 2023), and Video-P2P (Liu et al. 2024). For each model, Edit Board generates a detailed transcript that reports performance across each dimension and task type.
Researcher Affiliation Academia 1The Chinese University of Hong Kong, Shenzhen 2Nanjing University 3University of Leeds EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the mathematical formulation of the problem and various metrics, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Samchen2003/Edit Board
Open Datasets Yes We utilize samples from the DAVIS dataset (Pont-Tuset et al. 2017) to obtain the masks required for conducting Semantic Score testing. We also select samples from the LOVEUTGVE-2023 dataset (Wu et al. 2023b).
Dataset Splits No For each task, we select 10 videos containing a variety of objects, such as cars, animals, and humans. Each original video is paired with at least two target prompts according to the task category. For generating source prompts, we employ BLIP-2 (Li et al. 2023) for the automated generation of video captions. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models. We also ensure that sufficient original videos meet the requirements for applying FF-α, allowing for the full adoption of both FF-α and FF-β. Additionally, we adjust the target prompts for Single Object Single Attribute (SOSA) so that more than half of the edits focus on the foreground object to facilitate Semantic Score evaluation. The paper describes the selection of videos for testing and human alignment but does not specify formal training/test/validation splits with percentages or counts for reproducing a model training process or the benchmark's internal data partitioning.
Hardware Specification Yes The experiments are conducted on a single NVIDIA Ge Force RTX 4090.
Software Dependencies Yes To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models.
Experiment Setup Yes To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models.