reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Authors: Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply Bench Builder to datasets such as Chatbot Arena and Wild Chat-1M, extracting challenging prompts. To validate benchmark quality, we propose new metrics to measure a benchmark s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting of 500 challenging prompts curated by Bench Builder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings
Researcher Affiliation	Academia	1University of California, Berkeley. Correspondence to: Tianle Li <EMAIL>.
Pseudocode	No	The paper describes methods and processes (e.g., Bench Builder Pipeline in Figure 2, LLM-Judge System Instruction in Appendix G) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or structured code-like procedures.
Open Source Code	Yes	Our code is available at https://github.com/ lmarena/arena-hard-auto. and We open-source both Bench Builder pipeline and Arena Hard-Auto benchmark1. 1Our code is available at: https://github.com/ lmarena/arena-hard-auto
Open Datasets	Yes	We apply Bench Builder to crowd-sourced datasets, both Chatbot Arena (Chiang et al., 2024) and Wild Chat1M (Zhao et al., 2024), demonstrating that it can robustly generate high-quality benchmarks that differentiate models.
Dataset Splits	No	The paper describes the curation of benchmarks like Arena-Hard-Auto (500 prompts) and Wild-Hard-Auto (250 prompts) as evaluation sets, but does not specify any further training/validation/test splits of these benchmarks or other datasets for the experiments conducted.
Hardware Specification	No	The paper mentions costs associated with using LLM APIs (GPT-4-Turbo, Llama-3-70B-Instruct) for annotation but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments or training their pipeline.
Software Dependencies	Yes	To validate qualities assigned by GPT-4-Turbo, we construct ground truth labels for 200 sampled queries by collecting majority votes from GPT-4o (Open AI, 2024b), Claude-3-Opus, and Gemini-1.5-Pro (Reid et al., 2024)... In Table 4... Model GPT4-T (gpt-4-1106-preview), Claude-3-Opus, Gemini1.5-Pro (gemini-1.5-pro-0514), Llama3-70B (llama-3-70b-instruct).
Experiment Setup	Yes	Then we use GPT-4-Turbo (Open AI, 2023b) as a judge to assign a quality score to each prompt and remove any prompts. Prompts with a score less than 6 and topic clusters with a mean score less than 5 are discarded... To construct a 500-prompt benchmark, we sample 2 prompts each from 250 randomly selected clusters... We evaluate a model on a given prompt using a pairwise comparison against a strong baseline model (e.g., GPT-4-0314)... judge model (e.g., GPT-4-Turbo or Gemini-1.5-Pro) then scores each output by rating its preference between the pair on a 5-point Likert scale... To ensure consistency, we utilize chain-of-thought (Wei et al., 2023) prompting... We adopt the Bradley & Terry (1952) model to produce model s the final model scores... We use seed 42 for all experiments in this paper unless stated otherwise.