reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Statistical Framework for Ranking LLM-based Chatbots

Authors: Siavash Ameli, Siyuan Zhuang, Ion Stoica, Michael W Mahoney

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses. 1 INTRODUCTION The rapid advancement of large language models (LLMs) has transformed natural language processing, enabling breakthroughs across diverse tasks. As these models evolve, the need for effective evaluation methods becomes crucial for fostering innovation and ensuring that LLMs align with human preferences. Traditional benchmarks, such as MMLU (Hendrycks et al., 2021) and Human Eval (Chen et al., 2021), play an important role in assessing specific capabilities of LLMs. 3 EMPIRICAL EVALUATION OF STATISTICAL MODELS In this section, we evaluate the statistical models introduced earlier using the Chatbot Arena dataset. As of September 2024, the dataset comprises m = 129 competitors, with \|E\| = 3455 unique pairs. The total number of comparisons across all pairs is P {i,j} E nij = 1, 374, 996, distributed as follows: 43.3% wins, 36.2% losses, and 20.4% ties. We analyzed 30 configurations of the Bradley-Terry, Rao-Kupper, and Davidson models, detailed in Table D.1 in Appendix D.1. These configurations include both the original forms of the models and various generalizations introduced in this work, each assigned a unique ID corresponding to their rows in the table (e.g., Model 1, Model 2, etc.), which we reference throughout this and subsequent sections. For example, Model 1 corresponds to the Bradley-Terry model with ties treated as half a win and half a loss, following Chiang et al. (2024), while Model 4 represents the original Bradley-Terry model without ties. Similarly, Models 7 and 19 correspond to the original Rao-Kupper and Davidson models, respectively, with the remaining configurations representing our proposed generalizations.
Researcher Affiliation	Academia	Siavash Ameli ICSI and Department of Statistics University of California, Berkeley EMAIL Siyuan Zhuang Department of Computer Science University of California, Berkeley EMAIL Ion Stoica Department of Computer Science University of California, Berkeley EMAIL Michael W. Mahoney ICSI, LBNL, and Department of Statistics University of California, Berkeley EMAIL
Pseudocode	No	The paper describes methods using mathematical equations and text, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps formatted like code within such a block.
Open Source Code	Yes	To support reproducibility and broader adoption, we provide leaderbot, an open-source Python package implementing our statistical framework with tools for data processing, model fitting, and visualization. This ensures that all results in this paper are fully reproducible (Appendix G). ... We developed a Python package leaderbot3 that implements the methods presented in this paper. ... leaderbot is available for installation from Py PI at https://pypi.org/project/leaderbot. Documentation and usage instructions can be found at https://leaderbot.org. The source code is available on Git Hub at https://github.com/suquark/leaderbot.
Open Datasets	Yes	To address this gap, crowdsourced evaluation platforms have emerged, with Chatbot Arena (Chiang et al., 2024; Zheng et al., 2023) standing out as a pioneering framework. By facilitating millions of pairwise comparisons between LLMs based on human judgments, Chatbot Arena has become one of the largest and most credible datasets (Zheng et al., 2024) for chatbot evaluation. ... In this section, we evaluate the statistical models introduced earlier using the Chatbot Arena dataset.
Dataset Splits	Yes	To evaluate the models generalization performance, we trained each model on 90% of the data and tested predictions on the remaining 10%, with the data randomly split into training and test sets. Results for the weighted RMSE are presented in the fifth to eighth columns of Table D.3, while the KL and JS divergences are shown in the ninth and tenth columns, respectively. ... Listing G.3: Evaluating model generalization using train-test split in leaderbot. ... # Split data into training and test sets training_data, test_data = lb.data.split(data, test_ratio=0.1, seed=20)
Hardware Specification	Yes	Training time for each model, using an AMD EPYC 7543 processor with 32 cores, is shown in the last column of Table D.1.
Software Dependencies	No	The paper mentions developing a Python package `leaderbot` and using the BFGS optimization method. However, it does not provide specific version numbers for ancillary software components like Python itself, or other libraries that `leaderbot` might depend on (e.g., PyTorch, NumPy, SciPy) to fully replicate the environment.
Experiment Setup	Yes	We trained these models (except for models 1 and 4) by maximizing the likelihood function (1) using the BFGS optimization method, while satisfying the constraints in Section 2.5. This optimization method requires both the loss function ℓ(θ) and its Jacobian ℓ(θ)/ θ, which we analytically derived with respect to all parameters for each model and provided during training. To ensure consistency, we used a tolerance level of tol = 10 8 for convergence. Parameters were initialized as follows: scores x were initialized randomly while ensuring their sum is zero, diagonals of D were set to m 1, and all other parameters were initialized to zero.