reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RocketEval: Efficient automated LLM evaluation via grading checklist

Authors: Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval, when using Gemma2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, Rocket Eval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios.
Researcher Affiliation	Collaboration	1 City University of Hong Kong, 2 Tencent Youtu Lab, 3 Harbin Institute of Technology Shenzhen.
Pseudocode	No	The paper describes the methodology using text and mathematical formulas in Section 3, but no clearly labeled 'Pseudocode' or 'Algorithm' block is present.
Open Source Code	Yes	Our code is available at https://github.com/Joinn99/Rocket Eval-ICLR.
Open Datasets	Yes	Our experiments carried out on the automated evaluation benchmarks, MTBENCH and WILDBENCH datasets, reveal that Rocket Eval... We selected two benchmark datasets for our experiments: MT-BENCH (Zheng et al., 2023) and WILDBENCH (Lin et al., 2025).
Dataset Splits	No	The paper evaluates on established benchmarks like MT-BENCH and WILDBENCH, which have their own evaluation sets. For the supervised prediction, it mentions using 'a limited number of annotations' but does not specify how these annotations are split into training, validation, or test sets for the predictor itself, beyond the implicit use of the benchmarks' structure.
Hardware Specification	Yes	For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... Table 4: Rocket Eval Llama-3-70BAWQ 4 x A5000, Llama-3-8B 1 x A5000, Gemma-2-2B 1 x A5000, Qwen2.5-1.5B 1 x A5000
Software Dependencies	No	For open-source LLMs, we deploy them on NVIDIA RTX A5000 GPUs using v LLM (Kwon et al., 2023)... The paper mentions 'vLLM' as a tool but does not provide a specific version number for it or any other software libraries used.
Experiment Setup	Yes	As illustrated in Figure 2, Rocket Eval operates through a three-stage framework to generate evaluations. Initially, an instance-level checklist is created... Subsequently, lightweight LLMs assess the quality of responses for each checklist item independently... Finally, the evaluations for each item are collected to derive the final score... The prompts used are shown in Appendix A.1.