reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: AI Evaluation Should Learn from How We Test Humans

Authors: Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today s AI evaluations. In Appendix B, the paper describes a "Simulation Experiment for Ability Estimation" and a "Comparison of Rankings with Full Dataset", indicating empirical studies.
Researcher Affiliation	Collaboration	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China 2Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center, China 3University of California, Berkeley, USA 4Educational Testing Service, USA 5i FLYTEK Co., Ltd, China. Correspondence to: Qi Liu <EMAIL>. The affiliations include academic institutions (University of Science and Technology of China, University of California, Berkeley), a research organization (Educational Testing Service), and an industry company (i FLYTEK Co., Ltd), indicating a collaboration.
Pseudocode	No	The paper describes the theoretical framework and implementation steps for adaptive testing with equations and descriptive text, particularly in Section 3 and Appendix B, but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "Intermediate data for these experiments are also included in https://github.com/54zy/CAT4AI." (Appendix A.3) and "the complete data set is available at https://github.com/54zy/CAT4AI." (Appendix C). This refers to data availability, not the source code for the described methodology.
Open Datasets	Yes	The paper uses well-known, publicly available datasets such as "MATH (Hendrycks et al., 2021)", "Narrative QA (Koˇcisk y et al., 2018)", "RAFT (Alex et al., 2021)", "Med QA (Jin et al., 2021)", "MMLU (Hendrycks et al., 2020)", "Open Book QA (Mihaylov et al., 2018)", and "GSM8K" (Appendix A.3, B, C).
Dataset Splits	Yes	In Appendix B, under "Comparison of Rankings with Full Dataset", it states: "We collect responses from 20 LLMs on the MATH dataset and select a subset from it for evaluation... Next, we compare the rank correlation results obtained from different evaluation methods using the same percentages of the dataset." Figure 10(b) shows results for "10% of the full benchmark", "20% of the full benchmark", etc., indicating specific percentages used for evaluation.
Hardware Specification	No	The paper mentions "4,000 GPU hours (or $10,000 for APIs)" in the introduction, but this refers to the cost of evaluating the HELM benchmark by others, not the specific hardware used by the authors for their own experiments. No other specific hardware details (e.g., GPU models, CPU types) are provided for their experimental setup.
Software Dependencies	No	The paper does not explicitly state any software names with specific version numbers (e.g., programming languages, libraries, or frameworks) used for the experiments.
Experiment Setup	Yes	In Appendix A.4, "Illustrating Uncertainty in AI Evaluation", it states: "These 5 responses are generated using the same prompt across different sessions, with the default temperature setting of 1." This provides a specific hyperparameter setting (temperature=1) for an experimental illustration.