reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reliable and Efficient Amortized Model-based Evaluation

Authors: Sang T. Truong, Yuheng Tu, Percy Liang, Bo Li, Sanmi Koyejo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 22 common natural language benchmarks and 183 LMs show that this approach is more reliable and efficient compared to the current common practice.
Researcher Affiliation	Collaboration	1Stanford University 2University of California, Berkeley 3Virtue AI 4University of Illinois Urbana-Champaign. Correspondence to: Sang Truong <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations and prose (e.g., in Section 3 and 4) but does not include any distinct, structured pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	1Code: github.com/sangttruong/reeval. Adaptive testing on 22 datasets has been integrated into HELM: crfmhelm.readthedocs.io/en/latest/reeval/
Open Datasets	Yes	We use 22 datasets from 5 HELM repositories: Classic, Lite, AIR-Bench, Thai Exam, and MMLU, including both capability and safety measurements, including 183 test takers and 78,712 questions. ... Table 1: Number of test takers and questions in each benchmark. Dataset Name Number of Test Takers Number of questions Citation air bench 2024 41 4985 (Zeng et al., 2024) babi qa 70 3461 (Weston et al., 2015) ...
Dataset Splits	Yes	We randomly mask out 20% of the non-missing elements in the response matrix as the test set such that the resulting response matrix has no row or column with identical responses to ensure numerical stability. The unmasked data is used for model fitting. When appropriate, we also partition the train and test by questions or test takers (e.g., when we need to assert difficulty prediction model generalizability to new questions). ... Among FLOPS-based models, we allocate 80% for training and 20% for validation.
Hardware Specification	No	The paper mentions "many high-performance computers" in a general sense regarding LM evaluation, but does not specify any particular GPU models, CPU models, or other hardware specifications used for the authors' experiments.
Software Dependencies	No	The paper mentions using specific LLM models like "Llama3.1-8B-Instruct" and techniques like "SFT" and "PPO" with their parameters, and the "L-BFGS optimizer", but does not provide version numbers for any underlying software packages, libraries, or programming languages used to implement the experiments.
Experiment Setup	Yes	Performance is averaged over 10-fold cross-validation, and the L-BFGS optimizer is used to fit IRT models. ... We fine-tune Llama3.1-8B-Instruct with SFT on all dataset questions for one epoch using lr = 0.0001, a cosine scheduler (warmup ratio = 0.1), and Lo RA (α = 16, rank = 8, dropout = 0.1). We fine-tune the model using PPO with Lo RA (α = 128, rank = 64, dropout = 0.1), maintaining the SFT input format. Training spans 4 epochs on 25,000 inputs (1,000 per dataset) with batch size 2 and lr = 1.0e 5. During inference, we use a temperature of 0.6, top p of 0.9, and a max tokens of 256.