reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models

Authors: Alex Kipnis, Konstantinos Voudouris, Luca Schulze Buschoff, Eric Schulz

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, Hella Swag, MMLU, Truthful QA and Wino Grande (with d = 28, 632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original individual benchmark score with, on average, 1.24% root mean square error (RMSE), (2) reconstruct the original total score with 0.58% RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is r = 0.94.
Researcher Affiliation	Academia	Alex Kipnis 1 Konstantinos Voudouris 1,2 Luca M. Schulze Buschoff 1 Eric Schulz 1 1 Human-Centered AI, Helmholtz Munich 2 University of Cambridge Correspondence to EMAIL
Pseudocode	No	The paper describes procedures in numbered lists within the text, such as in Section 2.2 for cross-validated subsampling (e.g., '1. Uniformly sample k items and calculate the subtest scores sj for each subject j.'), but these are not formatted as distinct pseudocode or algorithm blocks.
Open Source Code	Yes	All reported analyses and and results can be viewed on www.github.com/adkipnis/metabench and our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench.
Open Datasets	Yes	We collected openly available item-wise accuracies from Hugging Face Datasets for the six benchmarks that are part of the Open LLM Leaderboard (Beeching et al., 2023)... our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench.
Dataset Splits	Yes	For comparability across benchmarks, we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008) in the following manner: For stratification, we calculated the grand average of the original benchmark scores for LLMs that ran on all six benchmarks, and used it to create a stratified 10% split as the global test set. For cross-validation per benchmark, we split off further 10% subset of the LLMs as a local validation set, this time stratifying by the specific benchmark score.
Hardware Specification	No	The paper does not provide specific hardware details (such as GPU or CPU models, memory, or processor types) used for running the analyses or experiments.
Software Dependencies	Yes	All analyses were performed in R 4.4.0 (R Core Team, 2024)... we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008)... GAMs were fitted using the mgcv package (Wood, 2017)... We used the mirt package (Chalmers, 2012)... We used the r Bayesian Optimization package (Yan, 2024)... We applied factor analysis to the estimated latent abilities using the psych package (William Revelle, 2023)... integrated in Eleuther AI s lm-evaluationharness (Gao et al., 2024)... We used the cat R package (Magis & Raiche, 2012) to simulate Computerized Adaptive Tests (CATs)...
Experiment Setup	Yes	For comparability across benchmarks, we normalized the benchmark scores on a percent scale.Per benchmark, we discarded subjects2 with the lowest 0.1% of scores... Items with standard deviations below 1% (too little variability) and mean accuracies above 95% (too easy) were removed. We removed items with rpbis 0. We subsampled all six benchmarks to 350 items. We used one-dimensional ϑj... We used the mirt package (Chalmers, 2012) with default options to fit the following IRT models to preselected subsets of 350 items each a 2PL model, a 3PL model with an additional lower asymptote parameter, a 4PL model with an additional upper asymptote parameter. We separately used the maximum a posteriori (MAP) method as well as the expected a posteriori sum (EAPsum) method to estimate each benchmark-specific one-dimensional latent ability... We used the r Bayesian Optimization package (Yan, 2024) with default settings to fine-tune the hyperparameters (µ, ν, τ) on the validation set to minimize the term RMSE(ϑ, s) + λk... We tested λ {0.01, 0.005, 0.001} and found 0.005 to yield a good trade-off between size and accuracy. Fine-tuning was done using unsloth on 4-bit quantized instruct models.