reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Preference Aggregation

Authors: Shreyas Kadekodi, Hayden Mctavish, Berk Ustun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show how selective rankings can promote transparency and robustness by revealing disagreement and abstaining from arbitration. Section 5: Experiments. In this section, we present an empirical study of selective aggregation on real-world datasets. Our goal is to benchmark the properties and behavior of selective rankings with respect to existing approaches in terms of transparency, robustness, and versatility.
Researcher Affiliation	Academia	Shreyas Kadekodi 1 * Hayden Mc Tavish 2 * Berk Ustun 1. *Equal contribution 1UCSD 2Duke University. Correspondence to: Berk Ustun <EMAIL>. UCSD and Duke University are academic institutions, and the email domain @ucsd.edu further confirms an academic affiliation.
Pseudocode	Yes	Algorithm 1 Selective Preference Aggregation. Algorithm 2 Solution Path Algorithm.
Open Source Code	Yes	We provide an open-source Python library for selective preference aggregation, available on Git Hub and installable via pip install selectiverank. We include additional results in Appendix D, and code to reproduce our results on Git Hub.
Open Datasets	Yes	We work with 5 preference datasets from different domains listed in Table 1. Each dataset encodes user preferences over items as votes, ratings, or rankings. We convert preferences to pairwise comparisons with ties and build rankings using our approach and baselines. - nba [49] - survivor [51] - lawschool [44] - csrankings [11] - sushi [39] We also work with the DICES dataset [7].
Dataset Splits	Yes	We randomly split users into two groups: a group of ptrain = 5 users whose labels we use to train our model; and a group of ptest = 118 users whose labels we use to evaluate the predictions of the model at an individual level once it is deployed. All experiments used 5-fold cross-validation on the training split.
Hardware Specification	Yes	All results reflect timings on a consumer-grade CPU with 2.3 GHz and 16 GB RAM. In our experiments, we are able to recover a certifiably optimal ranking quickly for 4/5 datasets using a commercial solver on a single-core CPU with 128GB RAM.
Software Dependencies	Yes	We report results for an exact approach that handles ties and returns a certifiably optimal ranking by solving an integer program using CPLEX v22 [35].
Experiment Setup	Yes	We fine-tuned a BERT-Mini model; all fine-tuning experiments used 5-fold cross-validation on the training split. We optimized with a learning rate of 2e-5 for up to 25 epochs, employing early stopping. We trained in mini-batches of size 16 and enabled oversampling of minority classes in each batch.