reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Authors: Yangsibo Huang, Milad Nasr, Anastasios Nikolas Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Liu, Ion Stoica, Florian Tramèr, Chiyuan Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Speciﬁcally, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (veriﬁed in a simulated, ofﬂine version of Chatbot Arena). Our attack consists of two steps: ﬁrst, we show how an attacker can determine which model was used to generate a given reply with more than 95% accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. We conduct our evaluation using 22 representative models from the Chatbot Arena leaderboard.
Researcher Affiliation	Collaboration	1Google 2UC Berkeley 3Anthropic 4Carnegie Mellon University 5Stanford University 6ETH Zurich.
Pseudocode	No	The paper describes the attack methodology and mitigations in natural language. There are no explicitly labeled sections or figures for pseudocode or algorithms.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Table 1. Types of prompts used to build the training-based detector, their sources, and corresponding examples. Category Source Type Example Normal chat, high-resource languages LMSYS-Chat-1M (Zheng et al., 2023a) English How can identity protection services help protect me against identity thef [...] Specialty chat Alpaca Code (Hendrycks et al., 2021) Coding [...] MATH (Hendrycks et al., 2021) Math [...] Adv Bench (Zou et al., 2023) Safety-violating [...]
Dataset Splits	Yes	We then train a logistic regression classiﬁer for each prompt-model pair (P, M) using an 80/20 train/test split.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments or simulations. It mentions querying model providers' APIs but not their own computational resources.
Software Dependencies	Yes	We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42. (footnote 8: scikit-learn.org/1.5/modules/generated/sklearn.linear_model.Logistic Regression.html)
Experiment Setup	Yes	We sample 200 prompts per category and gather 50 responses per model for each prompt (details on model access and decoding parameters are provided in Appendix C.1). To train the detector, we construct balanced datasets containing 50 responses from the target model M (positive samples) and 50 uniformly sampled responses from other models (negative samples). We then train a logistic regression classiﬁer for each prompt-model pair (P, M) using an 80/20 train/test split. We use the logistic regression model from the scikit-learn library8 with its default hyperparameters and a random state set to 42.