reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AutoBencher: Towards Declarative Benchmark Construction

Authors: XIANG LI, Farzaan Kaiyom, Evan Liu, Yifan Mai, Percy Liang, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present Auto Bencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These descriptions are optimized to improve the declared desiderata. We use Auto Bencher (powered by GPT4) to create datasets for math, multilinguality, knowledge, and safety. The scalability of Auto Bencher allows it to test fine-grained categories and tail knowledge, creating datasets that elicit 22% more model errors (i.e., difficulty) than existing benchmarks. On the novelty ends, Auto Bencher also helps identify specific gaps not captured by existing benchmarks: e.g., Gemini-Pro has knowledge gaps on Permian Extinction and Fordism while GPT-4o fails to decline harmful requests about cryptocurrency scams.
Researcher Affiliation	Academia	Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy Liang, Tatsunori Hashimoto Stanford University EMAIL
Pseudocode	Yes	We present the full Auto Bencher algorithm in Algorithm 1. Adaptive search refers to lines 1 to 7 in Algorithm 1. Algorithm 1: Auto Bencher
Open Source Code	Yes	Code is available at https://github.com/Xiang Li1999/Auto Bencher.git
Open Datasets	Yes	For the capability settings, HUMANBENCH contains datasets in MMLU (Hendrycks et al., 2021), including 4 history subjects (e.g., high school world history), 4 economy subjects (e.g., econometrics), and 7 science subjects (e.g., college physics). See the complete list in Appendix C. For mathematics, HUMANBENCH contains 7 datasets from the Mathematics Dataset (Saxton et al., 2019), which covers basic math capabilities: algebra, arithmetic, calculus, probability, comparison, measurement, numbers. For multilinguality, we compare with XOR QA (Asai et al., 2021), a multilingual question-answering dataset covering 7 diverse languages. We compare with the test set, split by language into 7 datasets. For the safety setting, we compare with XSTest (Röttger et al., 2024) and Harm Bench (Mazeika et al., 2024), which are popular safety datasets that evaluate whether a model can accurately reject harmful requests.
Dataset Splits	No	The paper mentions generating a
Hardware Specification	Yes	This is not computationally expensive given that we evaluated on 17 models. For the cost of generating datasets: each run of the Auto Bencher agent uses around 750K tokens, which costs around $15. Among them, 43K tokens are used for proposing topics, 576K tokens are used for constructing datasets, and 147K for evaluating the candidate LM. This dataset construction cost is not expensive compared with expert-curated datasets, which often cost thousands of dollars. For the cost of evaluating all the candidate LMs on the new dataset, the computational cost is also moderate. There are two places where we evaluate the candidate models on our Auto Bencher generated datasets: dataset selection and final evaluation of the selected dataset. In dataset selection, we generate a small dataset (\|D\| = 50) for each description to reduce the cost (see line 333 in the paper, line 6 and 12 in Algorithm 1), and there are roughly 20 dataset descriptions for each Auto Bencher run. The final evaluation on the selected dataset roughly involves \|D\| 500 queries and 17 models. We use vllm for model inference, and API calls for LLM-as-judge. We observe that LLM-as-judge is the actual compute time bottleneck, but this part can be parallelized significantly across models and across queries. As a result, our implementation is very time-efficient, it takes around 1h on 1 A100 GPU, and $30 on API calls for dataset selection and 30 min on 1 A100 GPU, and $15 on API calls for the final evaluation.
Software Dependencies	Yes	Auto Bencher uses gpt-4-0125-preview (Open AI, 2023) as LMevaluator (at temperature 0) to propose topics and generate the datasets. ... We use vllm for model inference, and API calls for LLM-as-judge. ... We augment LMevaluator with Python math libraries (e.g., I is Python libraries like sympy, scipy, numpy).
Experiment Setup	Yes	Auto Bencher uses gpt-4-0125-preview (Open AI, 2023) as LMevaluator (at temperature 0) to propose topics and generate the datasets. To construct a capability dataset, we perform N = 8 iterations of adaptive search, each proposing K = 5 descriptions, and we generate \|Dc\| = 50 examples per description. In the optimization objective, β1 = 1 and β2 = 10 are chosen so that the three terms have similar scales. To construct a safety dataset, we perform N = 10 iteration of adaptive search, each proposing K = 10 descriptions, and we generate 10 examples for each description.