reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation

Authors: Jasper Dekoninck, Maximilian Baader, Martin Vechev

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a series of experiments with POLYRATING that showcase its ability to quantify the influence of biases on the ratings of the models ( 4.1), its improved sample efficiency for various use-cases ( 4.2), and its ability to obtain reliable and comparable multivariate leaderboards ( 4.3).
Researcher Affiliation	Academia	Jasper Dekoninck, Maximilian Baader, Martin Vechev Department of Computer Science ETH Zurich, Switzerland EMAIL
Pseudocode	No	The paper describes the POLYRATING model and its optimization objective using mathematical equations but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Code is available at https://github.com/eth-sri/polyrating.
Open Datasets	Yes	We use the full Chatbot Arena dataset (Chiang et al., 2024b), which contains over one million questions across various tasks. [...] We use the public dataset from Wildbench (Lin et al., 2024) to obtain our LLM-based evaluation.
Dataset Splits	No	The paper mentions varying the number of available questions and using training/test splits, but does not provide specific percentages, absolute sample counts, or a detailed methodology (e.g., random seed, stratified splitting) to reproduce these splits. For example, 'We vary the number of available questions from the task and compute the logistic loss with respect to a hidden test set.'
Hardware Specification	No	Finally, we note that any run using POLYRATING took at most 6 hours on a single CPU, even for huge datasets with a million samples, 100 models and 10 tasks. While a 'single CPU' is mentioned, no specific model, make, or type of CPU is provided, which is insufficient for detailed hardware specification.
Software Dependencies	No	The paper mentions using classifiers from other works (Babakov et al., 2023; Camacho-collados et al., 2022) and links to HuggingFace models in footnotes, but it does not specify the versions of any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used in their own implementation.
Experiment Setup	Yes	We perform MAP estimation with a normal prior on the weights αj and βm j with mean 0 and deviations σj and σ j respectively. [...] Specifically, we use Newton s method for the model-specific parameters and L-BFGS for the shared parameters. [...] The standard deviation of the prior on βm task is determined by running cross-validation on the current training set.