reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

am-ELO: A Stable Framework for Arena-based LLM Evaluation

Authors: Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on real-world datasets, we demonstrate that our framework effectively models annotators while ensuring the consistency of ELO scores. Furthermore, in simulation experiments, our method not only identifies anomalous annotators but also reduces the inconsistency of ELO scores to 30% compared to the traditional ELO method.
Researcher Affiliation	Collaboration	1State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China 3i FLYTEK Co., Ltd, Hefei, China. Correspondence to: Qi Liu <EMAIL>.
Pseudocode	Yes	Algorithm 1 The Traditional ELO Rating System Algorithm 2 The am-ELO Rating System Algorithm 3 The Stable Arena framework
Open Source Code	Yes	The code can be found in the github: https://github.com/ bigdata-ustc/am-ELO.
Open Datasets	Yes	We conduct experiments on a real annotation dataset, Chatbot (Zheng et al., 2023), which was collected from 13,000 distinct IP addresses in the Chatbot Arena between April and June 2023.
Dataset Splits	No	The paper does not explicitly provide details about training/test/validation dataset splits, only mentioning filtering annotators with fewer than 50 annotated records.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions using a 'gradient descent (GD) approach' but does not list specific software dependencies with version numbers.
Experiment Setup	Yes	For the iterative ELO method, we perform repeated experiments by shuffling the dataset 1000 times and averaging the results. The MLE is solved using the gradient descent (GD) approach with a learning rate of 0.1 and a fixed number of 2000 iterations.