Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Authors: Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, Chiyuan Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than 95% accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. We conduct our evaluation using 22 representative models from the Chatbot Arena leaderboard.
Researcher Affiliation Collaboration 1Google 2UC Berkeley 3Anthropic 4Carnegie Mellon University 5Stanford University 6ETH Zurich.
Pseudocode No The paper describes the attack methodology and mitigations in natural language. There are no explicitly labeled sections or figures for pseudocode or algorithms.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes Table 1. Types of prompts used to build the training-based detector, their sources, and corresponding examples. Category Source Type Example Normal chat, high-resource languages LMSYS-Chat-1M (Zheng et al., 2023a) English How can identity protection services help protect me against identity thef [...] Specialty chat Alpaca Code (Hendrycks et al., 2021) Coding [...] MATH (Hendrycks et al., 2021) Math [...] Adv Bench (Zou et al., 2023) Safety-violating [...]
Dataset Splits Yes We then train a logistic regression classifier for each prompt-model pair (P, M) using an 80/20 train/test split.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments or simulations. It mentions querying model providers' APIs but not their own computational resources.
Software Dependencies Yes We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42. (footnote 8: scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Logistic Regression.html)
Experiment Setup Yes We sample 200 prompts per category and gather 50 responses per model for each prompt (details on model access and decoding parameters are provided in Appendix C.1). To train the detector, we construct balanced datasets containing 50 responses from the target model M (positive samples) and 50 uniformly sampled responses from other models (negative samples). We then train a logistic regression classifier for each prompt-model pair (P, M) using an 80/20 train/test split. We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42.