Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Authors: Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on around 1.7 million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. |
| Researcher Affiliation | Collaboration | 1Sea AI Lab 2Hong Kong University of Science and Technology 3Pennsylvania State University. Correspondence to: Tianyu Pang <EMAIL>, Minhao Cheng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Vote-filtering Strategy input Collected voting records V; Historical voting records VH; Threshold τ. output Filtered voting records VF . |
| Open Source Code | Yes | Code is publicly available to reproduce all experiments. |
| Open Datasets | Yes | To prevent contaminating the actual voting records on the Chatbot Arena platform, we set up a reproducible voting environment using the latest historical votes (as of January 2025) that are publicly available in the Chatbot Arena Notebook. This dataset contains around 1.7 million voting records across 129 models. |
| Dataset Splits | Yes | Within this environment, we divide 90% of the complete historical vote records as VH and the remainder as VO throughout all simulations. |
| Hardware Specification | Yes | The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions "Ro BERTa-based classifiers (Liu et al., 2019)" but does not provide specific version numbers for the software or libraries used, such as RoBERTa itself, Python, or a deep learning framework like PyTorch or TensorFlow. |
| Experiment Setup | Yes | The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs. |