reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Investigating Non-Transitivity in LLM-as-a-Judge

Authors: Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we investigate the presence of non-transitivity within the Alpaca Eval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% 96.4% and 82.1% 86.3% respectively).
Researcher Affiliation	Academia	1AI Centre, UCL 2UK AI Security Institute. Correspondence to: Yi Xu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Swiss-Wise Iterative Matchmaking (SWIM) tournament
Open Source Code	Yes	The code and data are available at https: //github.com/yix8/llm-nontransitivity.
Open Datasets	Yes	We use the Alpaca Eval dataset (Li et al., 2023), which includes a wide variety of instruction types, such as information search tasks and coding problems. Participating models. We evaluate 20 models that appear on both the Alpaca Eval and Chatbot Arena1 leaderboards (see Appendix A.1 for details).
Dataset Splits	No	The paper uses pre-generated outputs from the Alpaca Eval dataset and analyzes pairwise comparisons of models. It samples prompts for evaluation but does not describe traditional training, validation, or test splits for its own methodology.
Hardware Specification	No	We also thank to the Open AI researcher access program for providing the Open AI API credits used in this project. This indicates the use of OpenAI's API services, which abstract away the underlying hardware used for computations. No specific hardware (e.g., GPU models, CPU types) is mentioned for the experiments conducted by the authors.
Software Dependencies	No	The paper references LLM models used as judges (e.g., GPT-4-Turbo, GPT-3.5-Turbo) and frameworks (Alpaca Eval, Chatbot Arena), but it does not specify any particular software dependencies with version numbers (e.g., programming languages, libraries, or solvers with their exact versions) that would be needed for replication.
Experiment Setup	Yes	We examine non-transitivity in judgments using two models: GPT-4-Turbo2 and GPT-3.5-Turbo (Open AI et al., 2023), both with the temperature set to 0. The detailed prompt is provided in Appendix G.1. To mitigate this bias, we employ position switching, where each comparison is evaluated with responses in both orders. The final preference score is calculated as the mean of these balanced evaluations. To reduce the impact of API randomness, we invoke the judge function twice for each order configuration.