Selective Preference Aggregation

Authors: Shreyas Kadekodi, Hayden Mctavish, Berk Ustun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct an extensive set of experiments on real-world datasets to benchmark our approach and demonstrate its functionality. Our results show how selective rankings can promote transparency and robustness by revealing disagreement and abstaining from arbitration. Section 5: Experiments. In this section, we present an empirical study of selective aggregation on real-world datasets. Our goal is to benchmark the properties and behavior of selective rankings with respect to existing approaches in terms of transparency, robustness, and versatility.
Researcher Affiliation Academia Shreyas Kadekodi 1 * Hayden Mc Tavish 2 * Berk Ustun 1. *Equal contribution 1UCSD 2Duke University. Correspondence to: Berk Ustun <EMAIL>. UCSD and Duke University are academic institutions, and the email domain @ucsd.edu further confirms an academic affiliation.
Pseudocode Yes Algorithm 1 Selective Preference Aggregation. Algorithm 2 Solution Path Algorithm.
Open Source Code Yes We provide an open-source Python library for selective preference aggregation, available on Git Hub and installable via pip install selectiverank. We include additional results in Appendix D, and code to reproduce our results on Git Hub.
Open Datasets Yes We work with 5 preference datasets from different domains listed in Table 1. Each dataset encodes user preferences over items as votes, ratings, or rankings. We convert preferences to pairwise comparisons with ties and build rankings using our approach and baselines. - nba [49] - survivor [51] - lawschool [44] - csrankings [11] - sushi [39] We also work with the DICES dataset [7].
Dataset Splits Yes We randomly split users into two groups: a group of ptrain = 5 users whose labels we use to train our model; and a group of ptest = 118 users whose labels we use to evaluate the predictions of the model at an individual level once it is deployed. All experiments used 5-fold cross-validation on the training split.
Hardware Specification Yes All results reflect timings on a consumer-grade CPU with 2.3 GHz and 16 GB RAM. In our experiments, we are able to recover a certifiably optimal ranking quickly for 4/5 datasets using a commercial solver on a single-core CPU with 128GB RAM.
Software Dependencies Yes We report results for an exact approach that handles ties and returns a certifiably optimal ranking by solving an integer program using CPLEX v22 [35].
Experiment Setup Yes We fine-tuned a BERT-Mini model; all fine-tuning experiments used 5-fold cross-validation on the training split. We optimized with a learning rate of 2e-5 for up to 25 epochs, employing early stopping. We trained in mini-batches of size 16 and enabled oversampling of minority classes in each batch.