reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Authors: Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. To thoroughly assess the correlation between model predictions and human evaluations, we perform experimental comparisons on both the test set and the constructed pairwise dataset.
Researcher Affiliation	Academia	1Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2CSE department, The Chinese University of Hong Kong, Hong Kong,China 3Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong,China.
Pseudocode	No	The paper describes the data construction pipeline and evaluation process in prose (Sections 3.2 and 3.3) and includes mathematical formulas, but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	In Sec. 3.2, we introduce our data construction pipeline that collects triples (video,rationale,score) from human annotators to create video-instruction data, which teaches MLLMs reasoningly assessing specific aspects of generative videos. We then propose our GRADEO-Instruct dataset which we plan to release in Sec. 3.2.
Open Datasets	Yes	Prompt Collection To construct a comprehensive prompt dataset for text-to-video evaluation, we ensured broad coverage across the dimensions of Quality, Aesthetics, Consistency, and Alignment. This was achieved by collecting and sampling from large-scale text-to-video pretraining dataset Web Vid (Bain et al., 2021), real user-generated prompt dataset Vid Pro M (Wang & Yang, 2024), and existing open-source text-to-video benchmark datasets (Liu et al., 2024b; Huang et al., 2024; Sun et al., 2024; Feng et al., 2024; Liu et al., 2024c; Kou et al., 2024). Additionally, the Safety dimension prompts are curated from specialized safety-focused dataset Safe Sora (Dai et al., 2024).
Dataset Splits	Yes	Fine-tuning on train split dataset GRADEO-Instruct we collet in Sec. 3.2, we develop a MLLM evaluator called GRADEO. We extract 340 samples from the collected dataset GRADEO-Instruct to serve as the test set.
Hardware Specification	Yes	The learning rate is set to 1 × 10−5, and the model is trained for 10 epochs on 4 RTX 3090 (24G) GPUs. Except for Open Sora, which runs inference on 2 RTX 3090 GPUs, and Kling, which generates videos on the web, all other models perform inference on a single RTX 3090 GPU.
Software Dependencies	No	The paper mentions using Qwen2-VL-7B as the base model and applying the LoRA fine-tuning method, but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	The learning rate is set to 1 × 10−5, and the model is trained for 10 epochs on 4 RTX 3090 (24G) GPUs. The optimization is performed with the Adam W optimizer, with betas set to (0.9, 0.999) and epsilon set to 1e-08. The learning rate scheduler is cosine, with a warmup ratio of 0.1. The training is conducted on 4 RTX 3090 GPUs, with a total batch size of 4 for both training and evaluation.