reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Your Model Ranking on Chatbot Arena by Vote Rigging

Authors: Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on around 1.7 million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging.
Researcher Affiliation	Collaboration	1Sea AI Lab 2Hong Kong University of Science and Technology 3Pennsylvania State University. Correspondence to: Tianyu Pang <EMAIL>, Minhao Cheng <EMAIL>.
Pseudocode	Yes	Algorithm 1 Vote-filtering Strategy input Collected voting records V; Historical voting records VH; Threshold τ. output Filtered voting records VF .
Open Source Code	Yes	Code is publicly available to reproduce all experiments.
Open Datasets	Yes	To prevent contaminating the actual voting records on the Chatbot Arena platform, we set up a reproducible voting environment using the latest historical votes (as of January 2025) that are publicly available in the Chatbot Arena Notebook. This dataset contains around 1.7 million voting records across 129 models.
Dataset Splits	Yes	Within this environment, we divide 90% of the complete historical vote records as VH and the remainder as VO throughout all simulations.
Hardware Specification	Yes	The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions "Ro BERTa-based classifiers (Liu et al., 2019)" but does not provide specific version numbers for the software or libraries used, such as RoBERTa itself, Python, or a deep learning framework like PyTorch or TensorFlow.
Experiment Setup	Yes	The fine-tuning process includes 20 epochs with a batch size of 64 which takes a few hours on 2 NVIDIA A100 GPUs.