reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unbiased Evaluation of Large Language Models from a Causal Perspective

Authors: Meilin Chen, Jian Tian, Liang Ma, Di Xie, Weijie Chen, Jiang Zhu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.
Researcher Affiliation	Industry	1Hikvision Research Institute. Correspondence to: Meilin Chen <EMAIL>, Liang Ma <EMAIL>.
Pseudocode	No	The paper describes methods using natural language and mathematical formulations, but no structured pseudocode or algorithm blocks are explicitly labeled or presented in a code-like format.
Open Source Code	No	The paper states: "We build our method on the widely-used Open Compass (Contributors, 2023) evaluation framework." This refers to using an existing framework, not providing the authors' own source code for the methodology described in the paper.
Open Datasets	Yes	Following (Zhu et al., 2024), we evaluate on two widely used benchmarks for multiple-choice questions: ARC-Challenge (ARC-C) (Clark et al., 2018) and MMLU (Hendrycks et al., 2021). For mathematical problem-solving, we utilize the GSM8K dataset (Cobbe et al., 2021).
Dataset Splits	No	The paper mentions "All evaluations are conducted in a 5-shot setting", which describes the few-shot learning context, not explicit train/test/validation dataset splits. While it refers to using the "MMLU test set" in an ablation study, it does not provide general or specific split percentages, sample counts, or explicit instructions for partitioning the datasets used across all experiments.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU specifications, memory, or cloud computing instances with their configurations).
Software Dependencies	No	The paper states: "We build our method on the widely-used Open Compass (Contributors, 2023) evaluation framework." This mentions a software framework but does not specify version numbers for Open Compass itself or any other relevant libraries or programming languages, which are required for a reproducible description of ancillary software.
Experiment Setup	Yes	Following the approach in (Zhu et al., 2024), we set the generation temperature to 0 for all models and cap the output length at a maximum of 1000 tokens. All evaluations are conducted in a 5-shot setting, with results averaged over 5 runs. Moreover, the interventions in BOAT are not randomly combined but follow specific constraints. We regulate the probability of each intervention to ensure balance. When applying a binary transformation to questions, modifications involving phrases such as which or following were excluded. Furthermore, during the Answer Removal process, we ensured that the answers extracted from different questions were not identical. For additional details, please refer to Appendix C.