Re-evaluating Open-ended Evaluation of Large Language Models
Authors: Si-Qi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we provide an empirical simulation-based investigation of the former and lean on game theory for a solution to the latter. We show our method scales to a real-world LLM evaluation dataset (Section 4.2) and provide ratings that are invariant to redundancy and correspond to our intuition in the sense of risk-dominance (Harsanyi & Selten, 1988), with empirical evidence (Appendix F.4). Algorithm 1 describes our simulated model and prompt improvement procedure. |
| Researcher Affiliation | Industry | Google Deep Mind London, UK EMAIL |
| Pseudocode | Yes | Algorithm 1 Evolutionary model and prompt selection procedure |
| Open Source Code | No | No explicit statement or link for the source code of the methodology described in this paper is provided. The paper mentions the LMSYS data repository for evaluation data, but not for their implementation code. |
| Open Datasets | Yes | We evaluate our method on the arena-hard-v0.1 dataset (Li et al., 2024b) with 500 prompts and 17 competing models. The set of prompts as well as model responses are downloaded from LMSYS data repository (https://huggingface.co/spaces/ lmsys/arena-hard-browser) |
| Dataset Splits | No | No specific train/test/validation dataset splits are provided. The paper describes the arena-hard-v0.1 dataset with 500 prompts and how pairwise preference ratings were sampled, but not how this data is partitioned for training, validation, or testing of their proposed method. |
| Hardware Specification | No | No specific hardware details are provided. The conclusion mentions 'on commodity hardware', but this is too vague and does not provide specific model numbers or types. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions using an 'Adam optimizer Kingma (2014)', but this refers to the optimization algorithm rather than a specific software library and its version. |
| Experiment Setup | Yes | We use the same set of hyper-parameters for all our experiments. For affinity-entropy Hp apxq, we use p 1 and set kernel variance to 1e 6. To solve for a max affinity-entropy distribution we use gradient descent. The max affinity-entropy distribution is then used in NE and CCE solving. For NE solving using LLE approximation, we initialize temperature τ 1.0 which is annealed exponentially with a decay rate of 0.95 every 250 gradient updates if and only if the exploitability in the annealed game Lτpxq (Equation (4)) is at most 1e 5. We set the terminal temperature to τ 1e 2. We early terminate the equilibrium solving if we have found an ϵ NE with ϵ 1e 3. For CCE solving, the optimization problem is convex and we minimize Equation 8 directly. For gradient descent, we use an Adam optimizer Kingma (2014) with a fixed learning rate 1e 2 for all steps (maximizing affinity-entropy and equilibrium solving). |