metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models
Authors: Alex Kipnis, Konstantinos Voudouris, Luca Schulze Buschoff, Eric Schulz
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use data from n > 5000 LLMs to identify the most informative items of six benchmarks, ARC, GSM8K, Hella Swag, MMLU, Truthful QA and Wino Grande (with d = 28, 632 items in total). From them we distill a sparse benchmark, metabench, that has less than 3% of the original size of all six benchmarks combined. This new sparse benchmark goes beyond point scores by yielding estimators of the underlying benchmark-specific abilities. We show that these estimators (1) can be used to reconstruct each original individual benchmark score with, on average, 1.24% root mean square error (RMSE), (2) reconstruct the original total score with 0.58% RMSE, and (3) have a single underlying common factor whose Spearman correlation with the total score is r = 0.94. |
| Researcher Affiliation | Academia | Alex Kipnis 1 Konstantinos Voudouris 1,2 Luca M. Schulze Buschoff 1 Eric Schulz 1 1 Human-Centered AI, Helmholtz Munich 2 University of Cambridge Correspondence to EMAIL |
| Pseudocode | No | The paper describes procedures in numbered lists within the text, such as in Section 2.2 for cross-validated subsampling (e.g., '1. Uniformly sample k items and calculate the subtest scores sj for each subject j.'), but these are not formatted as distinct pseudocode or algorithm blocks. |
| Open Source Code | Yes | All reported analyses and and results can be viewed on www.github.com/adkipnis/metabench and our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench. |
| Open Datasets | Yes | We collected openly available item-wise accuracies from Hugging Face Datasets for the six benchmarks that are part of the Open LLM Leaderboard (Beeching et al., 2023)... our benchmark itself is available on www.huggingface.co/datasets/HCAI/metabench. |
| Dataset Splits | Yes | For comparability across benchmarks, we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008) in the following manner: For stratification, we calculated the grand average of the original benchmark scores for LLMs that ran on all six benchmarks, and used it to create a stratified 10% split as the global test set. For cross-validation per benchmark, we split off further 10% subset of the LLMs as a local validation set, this time stratifying by the specific benchmark score. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as GPU or CPU models, memory, or processor types) used for running the analyses or experiments. |
| Software Dependencies | Yes | All analyses were performed in R 4.4.0 (R Core Team, 2024)... we conducted a train-test-validation split of the LLMs using the caret package (Kuhn & Max, 2008)... GAMs were fitted using the mgcv package (Wood, 2017)... We used the mirt package (Chalmers, 2012)... We used the r Bayesian Optimization package (Yan, 2024)... We applied factor analysis to the estimated latent abilities using the psych package (William Revelle, 2023)... integrated in Eleuther AI s lm-evaluationharness (Gao et al., 2024)... We used the cat R package (Magis & Raiche, 2012) to simulate Computerized Adaptive Tests (CATs)... |
| Experiment Setup | Yes | For comparability across benchmarks, we normalized the benchmark scores on a percent scale.Per benchmark, we discarded subjects2 with the lowest 0.1% of scores... Items with standard deviations below 1% (too little variability) and mean accuracies above 95% (too easy) were removed. We removed items with rpbis 0. We subsampled all six benchmarks to 350 items. We used one-dimensional ϑj... We used the mirt package (Chalmers, 2012) with default options to fit the following IRT models to preselected subsets of 350 items each a 2PL model, a 3PL model with an additional lower asymptote parameter, a 4PL model with an additional upper asymptote parameter. We separately used the maximum a posteriori (MAP) method as well as the expected a posteriori sum (EAPsum) method to estimate each benchmark-specific one-dimensional latent ability... We used the r Bayesian Optimization package (Yan, 2024) with default settings to fine-tune the hyperparameters (µ, ν, τ) on the validation set to minimize the term RMSE(ϑ, s) + λk... We tested λ {0.01, 0.005, 0.001} and found 0.005 to yield a good trade-off between size and accuracy. Fine-tuning was done using unsloth on 4-bit quantized instruct models. |