RocketEval: Efficient automated LLM evaluation via grading checklist

Authors: Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval, when using Gemma2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, Rocket Eval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios.
Researcher Affiliation Collaboration 1 City University of Hong Kong, 2 Tencent Youtu Lab, 3 Harbin Institute of Technology Shenzhen.
Pseudocode No The paper describes the methodology using text and mathematical formulas in Section 3, but no clearly labeled 'Pseudocode' or 'Algorithm' block is present.
Open Source Code Yes Our code is available at https://github.com/Joinn99/Rocket Eval-ICLR.
Open Datasets Yes Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval... We selected two benchmark datasets for our experiments: MT-BENCH (Zheng et al., 2023) and WILDBENCH (Lin et al., 2025).
Dataset Splits No The paper evaluates on established benchmarks like MT-BENCH and WILDBENCH, which have their own evaluation sets. For the supervised prediction, it mentions using 'a limited number of annotations' but does not specify how these annotations are split into training, validation, or test sets for the predictor itself, beyond the implicit use of the benchmarks' structure.
Hardware Specification Yes For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... Table 4: Rocket Eval Llama-3-70BAWQ 4 x A5000, Llama-3-8B 1 x A5000, Gemma-2-2B 1 x A5000, Qwen2.5-1.5B 1 x A5000
Software Dependencies No For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... The paper mentions 'vLLM' as a tool but does not provide a specific version number for it or any other software libraries used.
Experiment Setup Yes As illustrated in Figure 2, Rocket Eval operates through a three-stage framework to generate evaluations. Initially, an instance-level checklist is created... Subsequently, lightweight LLMs assess the quality of responses for each checklist item independently... Finally, the evaluations for each item are collected to derive the final score... The prompts used are shown in Appendix A.1.