Improving Your Model Ranking on Chatbot Arena by Vote Rigging

Authors: Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on around 1.7 million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging.
Researcher Affiliation Collaboration 1Sea AI Lab 2Hong Kong University of Science and Technology 3Pennsylvania State University. Correspondence to: Tianyu Pang <EMAIL>, Minhao Cheng <EMAIL>.
Pseudocode Yes Algorithm 1 Vote-filtering Strategy input Collected voting records V; Historical voting records VH; Threshold τ. output Filtered voting records VF .
Open Source Code Yes Code is publicly available to reproduce all experiments.
Open Datasets Yes To prevent contaminating the actual voting records on the Chatbot Arena platform, we set up a reproducible voting environment using the latest historical votes (as of January 2025) that are publicly available in the Chatbot Arena Notebook. This dataset contains around 1.7 million voting records across 129 models.
Dataset Splits Yes Within this environment, we divide 90% of the complete historical vote records as VH and the remainder as VO throughout all simulations.
Hardware Specification Yes The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions "Ro BERTa-based classifiers (Liu et al., 2019)" but does not provide specific version numbers for the software or libraries used, such as RoBERTa itself, Python, or a deep learning framework like PyTorch or TensorFlow.
Experiment Setup Yes The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs.