Investigating Non-Transitivity in LLM-as-a-Judge

Authors: Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we investigate the presence of non-transitivity within the Alpaca Eval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% 96.4% and 82.1% 86.3% respectively).
Researcher Affiliation Academia 1AI Centre, UCL 2UK AI Security Institute. Correspondence to: Yi Xu <EMAIL>.
Pseudocode Yes Algorithm 1 Swiss-Wise Iterative Matchmaking (SWIM) tournament
Open Source Code Yes The code and data are available at https: //github.com/yix8/llm-nontransitivity.
Open Datasets Yes We use the Alpaca Eval dataset (Li et al., 2023), which includes a wide variety of instruction types, such as information search tasks and coding problems. Participating models. We evaluate 20 models that appear on both the Alpaca Eval and Chatbot Arena1 leaderboards (see Appendix A.1 for details).
Dataset Splits No The paper uses pre-generated outputs from the Alpaca Eval dataset and analyzes pairwise comparisons of models. It samples prompts for evaluation but does not describe traditional training, validation, or test splits for its own methodology.
Hardware Specification No We also thank to the Open AI researcher access program for providing the Open AI API credits used in this project. This indicates the use of OpenAI's API services, which abstract away the underlying hardware used for computations. No specific hardware (e.g., GPU models, CPU types) is mentioned for the experiments conducted by the authors.
Software Dependencies No The paper references LLM models used as judges (e.g., GPT-4-Turbo, GPT-3.5-Turbo) and frameworks (Alpaca Eval, Chatbot Arena), but it does not specify any particular software dependencies with version numbers (e.g., programming languages, libraries, or solvers with their exact versions) that would be needed for replication.
Experiment Setup Yes We examine non-transitivity in judgments using two models: GPT-4-Turbo2 and GPT-3.5-Turbo (Open AI et al., 2023), both with the temperature set to 0. The detailed prompt is provided in Appendix G.1. To mitigate this bias, we employ position switching, where each comparison is evaluated with responses in both orders. The final preference score is calculated as the mean of these balanced evaluations. To reduce the impact of API randomness, we invoke the judge function twice for each order configuration.