reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models

Authors: Yupeng Chen, Penglin Chen, Xiaoyu Zhang, Yixian Huang, Qian Xie

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this gap, we propose Edit Board, the first comprehensive evaluation benchmark for text-based video editing models. Edit Board encompasses nine automatic metrics across four key dimensions, evaluating models on four categories of tasks, and introduces three new metrics to assess fidelity. This task-oriented framework facilitates objective evaluation by breaking down model performance into details, providing insights into each model s strengths and weaknesses. By open-sourcing Edit Board, we aim to standardize evaluation and advance the development of robust video editing models. Additionally, we utilize Edit Board to evaluate five state-of-the-art video editing models, deriving valuable insights from the results. This section presents the evaluation experiment conducted using Edit Board, along with the human alignment test, which is specifically designed to validate the correlation between automatic metrics and human perception. We rigorously evaluate five state-of-the-art video editing models: Fate Zero (Qi et al. 2023), Control-A-Video (Chen et al. 2023b), Ground-A-Video (Jeong and Ye 2023), Token Flow (Geyer et al. 2023), and Video-P2P (Liu et al. 2024). For each model, Edit Board generates a detailed transcript that reports performance across each dimension and task type.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong, Shenzhen 2Nanjing University 3University of Leeds EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the mathematical formulation of the problem and various metrics, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Samchen2003/Edit Board
Open Datasets	Yes	We utilize samples from the DAVIS dataset (Pont-Tuset et al. 2017) to obtain the masks required for conducting Semantic Score testing. We also select samples from the LOVEUTGVE-2023 dataset (Wu et al. 2023b).
Dataset Splits	No	For each task, we select 10 videos containing a variety of objects, such as cars, animals, and humans. Each original video is paired with at least two target prompts according to the task category. For generating source prompts, we employ BLIP-2 (Li et al. 2023) for the automated generation of video captions. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models. We also ensure that sufficient original videos meet the requirements for applying FF-α, allowing for the full adoption of both FF-α and FF-β. Additionally, we adjust the target prompts for Single Object Single Attribute (SOSA) so that more than half of the edits focus on the foreground object to facilitate Semantic Score evaluation. The paper describes the selection of videos for testing and human alignment but does not specify formal training/test/validation splits with percentages or counts for reproducing a model training process or the benchmark's internal data partitioning.
Hardware Specification	Yes	The experiments are conducted on a single NVIDIA Ge Force RTX 4090.
Software Dependencies	Yes	To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models.
Experiment Setup	Yes	To ensure a fair comparison, we use Stable Diffusion v1-5 as the base for all video editing models. The original frames are resized to a uniform resolution of 512 × 512 to match the configuration of the testing models.