am-ELO: A Stable Framework for Arena-based LLM Evaluation

Authors: Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on real-world datasets, we demonstrate that our framework effectively models annotators while ensuring the consistency of ELO scores. Furthermore, in simulation experiments, our method not only identifies anomalous annotators but also reduces the inconsistency of ELO scores to 30% compared to the traditional ELO method.
Researcher Affiliation Collaboration 1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 3i FLYTEK Co., Ltd, Hefei, China. Correspondence to: Qi Liu <EMAIL>.
Pseudocode Yes Algorithm 1 The Traditional ELO Rating System Algorithm 2 The am-ELO Rating System Algorithm 3 The Stable Arena framework
Open Source Code Yes The code can be found in the github: https://github.com/ bigdata-ustc/am-ELO.
Open Datasets Yes We conduct experiments on a real annotation dataset, Chatbot (Zheng et al., 2023), which was collected from 13,000 distinct IP addresses in the Chatbot Arena between April and June 2023.
Dataset Splits No The paper does not explicitly provide details about training/test/validation dataset splits, only mentioning filtering annotators with fewer than 50 annotated records.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper mentions using a 'gradient descent (GD) approach' but does not list specific software dependencies with version numbers.
Experiment Setup Yes For the iterative ELO method, we perform repeated experiments by shuffling the dataset 1000 times and averaging the results. The MLE is solved using the gradient descent (GD) approach with a learning rate of 0.1 and a fixed number of 2000 iterations.