Unbiased Evaluation of Large Language Models from a Causal Perspective

Authors: Meilin Chen, Jian Tian, Liang Ma, Di Xie, Weijie Chen, Jiang Zhu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only offers strong evidence of benchmark contamination but also provides interpretable evaluation results.
Researcher Affiliation Industry 1Hikvision Research Institute. Correspondence to: Meilin Chen <EMAIL>, Liang Ma <EMAIL>.
Pseudocode No The paper describes methods using natural language and mathematical formulations, but no structured pseudocode or algorithm blocks are explicitly labeled or presented in a code-like format.
Open Source Code No The paper states: "We build our method on the widely-used Open Compass (Contributors, 2023) evaluation framework." This refers to using an existing framework, not providing the authors' own source code for the methodology described in the paper.
Open Datasets Yes Following (Zhu et al., 2024), we evaluate on two widely used benchmarks for multiple-choice questions: ARC-Challenge (ARC-C) (Clark et al., 2018) and MMLU (Hendrycks et al., 2021). For mathematical problem-solving, we utilize the GSM8K dataset (Cobbe et al., 2021).
Dataset Splits No The paper mentions "All evaluations are conducted in a 5-shot setting", which describes the few-shot learning context, not explicit train/test/validation dataset splits. While it refers to using the "MMLU test set" in an ablation study, it does not provide general or specific split percentages, sample counts, or explicit instructions for partitioning the datasets used across all experiments.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU specifications, memory, or cloud computing instances with their configurations).
Software Dependencies No The paper states: "We build our method on the widely-used Open Compass (Contributors, 2023) evaluation framework." This mentions a software framework but does not specify version numbers for Open Compass itself or any other relevant libraries or programming languages, which are required for a reproducible description of ancillary software.
Experiment Setup Yes Following the approach in (Zhu et al., 2024), we set the generation temperature to 0 for all models and cap the output length at a maximum of 1000 tokens. All evaluations are conducted in a 5-shot setting, with results averaged over 5 runs. Moreover, the interventions in BOAT are not randomly combined but follow specific constraints. We regulate the probability of each intervention to ensure balance. When applying a binary transformation to questions, modifications involving phrases such as which or following were excluded. Furthermore, during the Answer Removal process, we ensured that the answers extracted from different questions were not identical. For additional details, please refer to Appendix C.