Position: AI Evaluation Should Learn from How We Test Humans
Authors: Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today s AI evaluations. In Appendix B, the paper describes a "Simulation Experiment for Ability Estimation" and a "Comparison of Rankings with Full Dataset", indicating empirical studies. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China 3University of California, Berkeley, USA 4Educational Testing Service, USA 5i FLYTEK Co., Ltd, China. Correspondence to: Qi Liu <EMAIL>. The affiliations include academic institutions (University of Science and Technology of China, University of California, Berkeley), a research organization (Educational Testing Service), and an industry company (i FLYTEK Co., Ltd), indicating a collaboration. |
| Pseudocode | No | The paper describes the theoretical framework and implementation steps for adaptive testing with equations and descriptive text, particularly in Section 3 and Appendix B, but does not present any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: "Intermediate data for these experiments are also included in https://github.com/54zy/CAT4AI." (Appendix A.3) and "the complete data set is available at https://github.com/54zy/CAT4AI." (Appendix C). This refers to data availability, not the source code for the described methodology. |
| Open Datasets | Yes | The paper uses well-known, publicly available datasets such as "MATH (Hendrycks et al., 2021)", "Narrative QA (Koˇcisk y et al., 2018)", "RAFT (Alex et al., 2021)", "Med QA (Jin et al., 2021)", "MMLU (Hendrycks et al., 2020)", "Open Book QA (Mihaylov et al., 2018)", and "GSM8K" (Appendix A.3, B, C). |
| Dataset Splits | Yes | In Appendix B, under "Comparison of Rankings with Full Dataset", it states: "We collect responses from 20 LLMs on the MATH dataset and select a subset from it for evaluation... Next, we compare the rank correlation results obtained from different evaluation methods using the same percentages of the dataset." Figure 10(b) shows results for "10% of the full benchmark", "20% of the full benchmark", etc., indicating specific percentages used for evaluation. |
| Hardware Specification | No | The paper mentions "4,000 GPU hours (or $10,000 for APIs)" in the introduction, but this refers to the cost of evaluating the HELM benchmark by others, not the specific hardware used by the authors for their own experiments. No other specific hardware details (e.g., GPU models, CPU types) are provided for their experimental setup. |
| Software Dependencies | No | The paper does not explicitly state any software names with specific version numbers (e.g., programming languages, libraries, or frameworks) used for the experiments. |
| Experiment Setup | Yes | In Appendix A.4, "Illustrating Uncertainty in AI Evaluation", it states: "These 5 responses are generated using the same prompt across different sessions, with the default temperature setting of 1." This provides a specific hyperparameter setting (temperature=1) for an experimental illustration. |