Re-evaluating Open-ended Evaluation of Large Language Models

Authors: Si-Qi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we provide an empirical simulation-based investigation of the former and lean on game theory for a solution to the latter. We show our method scales to a real-world LLM evaluation dataset (Section 4.2) and provide ratings that are invariant to redundancy and correspond to our intuition in the sense of risk-dominance (Harsanyi & Selten, 1988), with empirical evidence (Appendix F.4). Algorithm 1 describes our simulated model and prompt improvement procedure.
Researcher Affiliation Industry Google Deep Mind London, UK EMAIL
Pseudocode Yes Algorithm 1 Evolutionary model and prompt selection procedure
Open Source Code No No explicit statement or link for the source code of the methodology described in this paper is provided. The paper mentions the LMSYS data repository for evaluation data, but not for their implementation code.
Open Datasets Yes We evaluate our method on the arena-hard-v0.1 dataset (Li et al., 2024b) with 500 prompts and 17 competing models. The set of prompts as well as model responses are downloaded from LMSYS data repository (https://huggingface.co/spaces/ lmsys/arena-hard-browser)
Dataset Splits No No specific train/test/validation dataset splits are provided. The paper describes the arena-hard-v0.1 dataset with 500 prompts and how pairwise preference ratings were sampled, but not how this data is partitioned for training, validation, or testing of their proposed method.
Hardware Specification No No specific hardware details are provided. The conclusion mentions 'on commodity hardware', but this is too vague and does not provide specific model numbers or types.
Software Dependencies No No specific software dependencies with version numbers are provided. The paper mentions using an 'Adam optimizer Kingma (2014)', but this refers to the optimization algorithm rather than a specific software library and its version.
Experiment Setup Yes We use the same set of hyper-parameters for all our experiments. For affinity-entropy Hp apxq, we use p 1 and set kernel variance to 1e 6. To solve for a max affinity-entropy distribution we use gradient descent. The max affinity-entropy distribution is then used in NE and CCE solving. For NE solving using LLE approximation, we initialize temperature τ 1.0 which is annealed exponentially with a decay rate of 0.95 every 250 gradient updates if and only if the exploitability in the annealed game Lτpxq (Equation (4)) is at most 1e 5. We set the terminal temperature to τ 1e 2. We early terminate the equilibrium solving if we have found an ϵ NE with ϵ 1e 3. For CCE solving, the optimization problem is convex and we minimize Equation 8 directly. For gradient descent, we use an Adam optimizer Kingma (2014) with a fixed learning rate 1e 2 for all steps (maximizing affinity-entropy and equilibrium solving).