Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Authors: Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, Chiyuan Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than 95% accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. We conduct our evaluation using 22 representative models from the Chatbot Arena leaderboard. |
| Researcher Affiliation | Collaboration | 1Google 2UC Berkeley 3Anthropic 4Carnegie Mellon University 5Stanford University 6ETH Zurich. |
| Pseudocode | No | The paper describes the attack methodology and mitigations in natural language. There are no explicitly labeled sections or figures for pseudocode or algorithms. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Table 1. Types of prompts used to build the training-based detector, their sources, and corresponding examples. Category Source Type Example Normal chat, high-resource languages LMSYS-Chat-1M (Zheng et al., 2023a) English How can identity protection services help protect me against identity thef [...] Specialty chat Alpaca Code (Hendrycks et al., 2021) Coding [...] MATH (Hendrycks et al., 2021) Math [...] Adv Bench (Zou et al., 2023) Safety-violating [...] |
| Dataset Splits | Yes | We then train a logistic regression classifier for each prompt-model pair (P, M) using an 80/20 train/test split. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments or simulations. It mentions querying model providers' APIs but not their own computational resources. |
| Software Dependencies | Yes | We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42. (footnote 8: scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Logistic Regression.html) |
| Experiment Setup | Yes | We sample 200 prompts per category and gather 50 responses per model for each prompt (details on model access and decoding parameters are provided in Appendix C.1). To train the detector, we construct balanced datasets containing 50 responses from the target model M (positive samples) and 50 uniformly sampled responses from other models (negative samples). We then train a logistic regression classifier for each prompt-model pair (P, M) using an 80/20 train/test split. We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42. |