Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Authors: Jasper Dekoninck, Maximilian Baader, Martin Vechev
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a series of experiments with POLYRATING that showcase its ability to quantify the influence of biases on the ratings of the models ( 4.1), its improved sample efficiency for various use-cases ( 4.2), and its ability to obtain reliable and comparable multivariate leaderboards ( 4.3). |
| Researcher Affiliation | Academia | Jasper Dekoninck, Maximilian Baader, Martin Vechev Department of Computer Science ETH Zurich, Switzerland EMAIL |
| Pseudocode | No | The paper describes the POLYRATING model and its optimization objective using mathematical equations but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code is available at https://github.com/eth-sri/polyrating. |
| Open Datasets | Yes | We use the full Chatbot Arena dataset (Chiang et al., 2024b), which contains over one million questions across various tasks. [...] We use the public dataset from Wildbench (Lin et al., 2024) to obtain our LLM-based evaluation. |
| Dataset Splits | No | The paper mentions varying the number of available questions and using training/test splits, but does not provide specific percentages, absolute sample counts, or a detailed methodology (e.g., random seed, stratified splitting) to reproduce these splits. For example, 'We vary the number of available questions from the task and compute the logistic loss with respect to a hidden test set.' |
| Hardware Specification | No | Finally, we note that any run using POLYRATING took at most 6 hours on a single CPU, even for huge datasets with a million samples, 100 models and 10 tasks. While a 'single CPU' is mentioned, no specific model, make, or type of CPU is provided, which is insufficient for detailed hardware specification. |
| Software Dependencies | No | The paper mentions using classifiers from other works (Babakov et al., 2023; Camacho-collados et al., 2022) and links to HuggingFace models in footnotes, but it does not specify the versions of any programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow) used in their own implementation. |
| Experiment Setup | Yes | We perform MAP estimation with a normal prior on the weights αj and βm j with mean 0 and deviations σj and σ j respectively. [...] Specifically, we use Newton s method for the model-specific parameters and L-BFGS for the shared parameters. [...] The standard deviation of the prior on βm task is determined by running cross-validation on the current training set. |