GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning
Authors: Zhun Mou, Bin Xia, Zhengchao Huang, Wenming Yang, Jiaya Jia
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. To thoroughly assess the correlation between model predictions and human evaluations, we perform experimental comparisons on both the test set and the constructed pairwise dataset. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China 2CSE department, The Chinese University of Hong Kong, Hong Kong,China 3Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong,China. |
| Pseudocode | No | The paper describes the data construction pipeline and evaluation process in prose (Sections 3.2 and 3.3) and includes mathematical formulas, but does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | In Sec. 3.2, we introduce our data construction pipeline that collects triples (video,rationale,score) from human annotators to create video-instruction data, which teaches MLLMs reasoningly assessing specific aspects of generative videos. We then propose our GRADEO-Instruct dataset which we plan to release in Sec. 3.2. |
| Open Datasets | Yes | Prompt Collection To construct a comprehensive prompt dataset for text-to-video evaluation, we ensured broad coverage across the dimensions of Quality, Aesthetics, Consistency, and Alignment. This was achieved by collecting and sampling from large-scale text-to-video pretraining dataset Web Vid (Bain et al., 2021), real user-generated prompt dataset Vid Pro M (Wang & Yang, 2024), and existing open-source text-to-video benchmark datasets (Liu et al., 2024b; Huang et al., 2024; Sun et al., 2024; Feng et al., 2024; Liu et al., 2024c; Kou et al., 2024). Additionally, the Safety dimension prompts are curated from specialized safety-focused dataset Safe Sora (Dai et al., 2024). |
| Dataset Splits | Yes | Fine-tuning on train split dataset GRADEO-Instruct we collet in Sec. 3.2, we develop a MLLM evaluator called GRADEO. We extract 340 samples from the collected dataset GRADEO-Instruct to serve as the test set. |
| Hardware Specification | Yes | The learning rate is set to 1 × 10−5, and the model is trained for 10 epochs on 4 RTX 3090 (24G) GPUs. Except for Open Sora, which runs inference on 2 RTX 3090 GPUs, and Kling, which generates videos on the web, all other models perform inference on a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using Qwen2-VL-7B as the base model and applying the LoRA fine-tuning method, but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The learning rate is set to 1 × 10−5, and the model is trained for 10 epochs on 4 RTX 3090 (24G) GPUs. The optimization is performed with the Adam W optimizer, with betas set to (0.9, 0.999) and epsilon set to 1e-08. The learning rate scheduler is cosine, with a warmup ratio of 0.1. The training is conducted on 4 RTX 3090 GPUs, with a total batch size of 4 for both training and evaluation. |