metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models

Authors: Alex Kipnis, Konstantinos Voudouris, Luca Schulze Buschoff, Eric Schulz

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, Hella Swag, MMLU, Truthful QA and Wino Grande (with d = 28, 632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original individual benchmark score with, on average, 1.24% root mean square error (RMSE), (2) reconstruct the original total score with 0.58% RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is r = 0.94.
Researcher Affiliation Academia Alex Kipnis 1 Konstantinos Voudouris 1,2 Luca M. Schulze Buschoff 1 Eric Schulz 1 1 Human-Centered AI, Helmholtz Munich 2 University of Cambridge Correspondence to EMAIL
Pseudocode No The paper describes procedures in numbered lists within the text, such as in Section 2.2 for cross-validated subsampling (e.g., '1. Uniformly sample k items and calculate the subtest scores sj for each subject j.'), but these are not formatted as distinct pseudocode or algorithm blocks.
Open Source Code Yes All reported analyses and and results can be viewed on www.github.com/adkipnis/metabench and our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench.
Open Datasets Yes We collected openly available item-wise accuracies from Hugging Face Datasets for the six benchmarks that are part of the Open LLM Leaderboard (Beeching et al., 2023)... our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench.
Dataset Splits Yes For comparability across benchmarks, we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008) in the following manner: For stratification, we calculated the grand average of the original benchmark scores for LLMs that ran on all six benchmarks, and used it to create a stratified 10% split as the global test set. For cross-validation per benchmark, we split off further 10% subset of the LLMs as a local validation set, this time stratifying by the specific benchmark score.
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, memory, or processor types) used for running the analyses or experiments.
Software Dependencies Yes All analyses were performed in R 4.4.0 (R Core Team, 2024)... we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008)... GAMs were fitted using the mgcv package (Wood, 2017)... We used the mirt package (Chalmers, 2012)... We used the r Bayesian Optimization package (Yan, 2024)... We applied factor analysis to the estimated latent abilities using the psych package (William Revelle, 2023)... integrated in Eleuther AI s lm-evaluationharness (Gao et al., 2024)... We used the cat R package (Magis & Raiche, 2012) to simulate Computerized Adaptive Tests (CATs)...
Experiment Setup Yes For comparability across benchmarks, we normalized the benchmark scores on a percent scale.Per benchmark, we discarded subjects2 with the lowest 0.1% of scores... Items with standard deviations below 1% (too little variability) and mean accuracies above 95% (too easy) were removed. We removed items with rpbis 0. We subsampled all six benchmarks to 350 items. We used one-dimensional ϑj... We used the mirt package (Chalmers, 2012) with default options to fit the following IRT models to preselected subsets of 350 items each a 2PL model, a 3PL model with an additional lower asymptote parameter, a 4PL model with an additional upper asymptote parameter. We separately used the maximum a posteriori (MAP) method as well as the expected a posteriori sum (EAPsum) method to estimate each benchmark-specific one-dimensional latent ability... We used the r Bayesian Optimization package (Yan, 2024) with default settings to fine-tune the hyperparameters (µ, ν, τ) on the validation set to minimize the term RMSE(ϑ, s) + λk... We tested λ {0.01, 0.005, 0.001} and found 0.005 to yield a good trade-off between size and accuracy. Fine-tuning was done using unsloth on 4-bit quantized instruct models.