WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Authors: Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce Wild Bench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WILDBENCH consists of 1,024 examples carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WILDBENCH, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WILDBENCH results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both Arena Hard s 0.91 and Alpaca Eval2.0 s 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates. |
| Researcher Affiliation | Collaboration | Allen Institute for AI University of Washington |
| Pseudocode | No | The paper includes prompt templates for evaluation in Appendix D and E, which show structured instructions for LLM judges, but these are not pseudocode or algorithm blocks describing a computational procedure in a code-like format. They are instructions for an LLM. |
| Open Source Code | Yes | Our evaluation results on the public subset of WILDBENCH can be reproduced using evaluation scripts available at https://github.com/allenai/Wild Bench/. We have included generation script for each model under the folder https://github.com/allenai/Wild Bench/tree/main/scripts, and the scripts for evaluating generations can be found at https://github.com/allenai/Wild Bench/tree/main/evaluation. |
| Open Datasets | Yes | We sourced tasks from the Wild Chat dataset (Zhao et al., 2024), which comprises one million human-chatbot conversations from real users. [...] The dataset documentation, metadata, and the public subset of WILDBENCH can be found at https://huggingface.co/datasets/allenai/Wild Bench/viewer/v2. |
| Dataset Splits | No | WILDBENCH consists of 1,024 examples carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WILDBENCH, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. [...] WB-Reward (Mix) is the average of the rewards from these three baselines on 1024 examples, providing a more robust performance evaluation on WILDBENCH. The paper mentions selecting 1,024 examples for the benchmark but does not specify any training/validation/test splits, as the entire dataset is used for evaluation. |
| Hardware Specification | No | The paper mentions that GPT-4-Turbo, one of the chatbots behind Wild Chat, supports up to 128K context tokens and 4K output tokens, but does not specify any hardware used for conducting the experiments or evaluations described in the paper. |
| Software Dependencies | No | The paper mentions using GPT-4-Turbo (Open AI, 2023), Claude-3-Sonnet, and Opus (Anthropic, 2024) as LLM judges and Sentence BERT (Reimers & Gurevych, 2019) for sentence embeddings. However, specific version numbers for these software dependencies are not provided in the main text. |
| Experiment Setup | Yes | We employ two primary metrics: WB-Reward for pairwise comparisons and WB-Score for individual scoring. WB-Reward is based on pairwise comparisons between LLMs, with five possible outcomes: A is much/slightly better/worse than B or Tie. [...] To mitigate the bias towards longer outputs, a common issue in LLM-as-a-judge evaluations (Dubois et al., 2024), we introduced a simple length-penalty method, converting slight wins/losses to ties when the winner s output is significantly longer than the loser s. [...] We experimented with different K (100, 200, 500, 1000, inf) in the length penalty method. We found that K = 500 is the best choice, as it achieves the highest correlation with human judgments. |