RocketEval: Efficient automated LLM evaluation via grading checklist
Authors: Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval, when using Gemma2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, Rocket Eval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. |
| Researcher Affiliation | Collaboration | 1 City University of Hong Kong, 2 Tencent Youtu Lab, 3 Harbin Institute of Technology Shenzhen. |
| Pseudocode | No | The paper describes the methodology using text and mathematical formulas in Section 3, but no clearly labeled 'Pseudocode' or 'Algorithm' block is present. |
| Open Source Code | Yes | Our code is available at https://github.com/Joinn99/Rocket Eval-ICLR. |
| Open Datasets | Yes | Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval... We selected two benchmark datasets for our experiments: MT-BENCH (Zheng et al., 2023) and WILDBENCH (Lin et al., 2025). |
| Dataset Splits | No | The paper evaluates on established benchmarks like MT-BENCH and WILDBENCH, which have their own evaluation sets. For the supervised prediction, it mentions using 'a limited number of annotations' but does not specify how these annotations are split into training, validation, or test sets for the predictor itself, beyond the implicit use of the benchmarks' structure. |
| Hardware Specification | Yes | For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... Table 4: Rocket Eval Llama-3-70BAWQ 4 x A5000, Llama-3-8B 1 x A5000, Gemma-2-2B 1 x A5000, Qwen2.5-1.5B 1 x A5000 |
| Software Dependencies | No | For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... The paper mentions 'vLLM' as a tool but does not provide a specific version number for it or any other software libraries used. |
| Experiment Setup | Yes | As illustrated in Figure 2, Rocket Eval operates through a three-stage framework to generate evaluations. Initially, an instance-level checklist is created... Subsequently, lightweight LLMs assess the quality of responses for each checklist item independently... Finally, the evaluations for each item are collected to derive the final score... The prompts used are shown in Appendix A.1. |