reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Re-evaluating Open-ended Evaluation of Large Language Models

Authors: Si-Qi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess, Marc Lanctot

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide an empirical simulation-based investigation of the former and lean on game theory for a solution to the latter. We show our method scales to a real-world LLM evaluation dataset (Section 4.2) and provide ratings that are invariant to redundancy and correspond to our intuition in the sense of risk-dominance (Harsanyi & Selten, 1988), with empirical evidence (Appendix F.4). Algorithm 1 describes our simulated model and prompt improvement procedure.
Researcher Affiliation	Industry	Google Deep Mind London, UK EMAIL
Pseudocode	Yes	Algorithm 1 Evolutionary model and prompt selection procedure
Open Source Code	No	No explicit statement or link for the source code of the methodology described in this paper is provided. The paper mentions the LMSYS data repository for evaluation data, but not for their implementation code.
Open Datasets	Yes	We evaluate our method on the arena-hard-v0.1 dataset (Li et al., 2024b) with 500 prompts and 17 competing models. The set of prompts as well as model responses are downloaded from LMSYS data repository (https://huggingface.co/spaces/ lmsys/arena-hard-browser)
Dataset Splits	No	No specific train/test/validation dataset splits are provided. The paper describes the arena-hard-v0.1 dataset with 500 prompts and how pairwise preference ratings were sampled, but not how this data is partitioned for training, validation, or testing of their proposed method.
Hardware Specification	No	No specific hardware details are provided. The conclusion mentions 'on commodity hardware', but this is too vague and does not provide specific model numbers or types.
Software Dependencies	No	No specific software dependencies with version numbers are provided. The paper mentions using an 'Adam optimizer Kingma (2014)', but this refers to the optimization algorithm rather than a specific software library and its version.
Experiment Setup	Yes	We use the same set of hyper-parameters for all our experiments. For affinity-entropy Hp apxq, we use p 1 and set kernel variance to 1e 6. To solve for a max affinity-entropy distribution we use gradient descent. The max affinity-entropy distribution is then used in NE and CCE solving. For NE solving using LLE approximation, we initialize temperature τ 1.0 which is annealed exponentially with a decay rate of 0.95 every 250 gradient updates if and only if the exploitability in the annealed game Lτpxq (Equation (4)) is at most 1e 5. We set the terminal temperature to τ 1e 2. We early terminate the equilibrium solving if we have found an ϵ NE with ϵ 1e 3. For CCE solving, the optimization problem is convex and we minimize Equation 8 directly. For gradient descent, we use an Adam optimizer Kingma (2014) with a fixed learning rate 1e 2 for all steps (maximizing affinity-entropy and equilibrium solving).