Conformal Prediction Sets Can Cause Disparate Impact

Authors: Jesse Cresswell, Bhargava Kumar, Yi Sui, Mouloud Belbahri

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through pre-registered, randomized controlled trials with human participants, we find that prediction sets can lead to disparate impact the increases in accuracy compared to the control population are not equal across groups (Figure 1).
Researcher Affiliation Industry Jesse C. Cresswell Layer 6 AI EMAIL Bhargava Kumar TD Securities EMAIL Yi Sui Layer 6 AI EMAIL Mouloud Belbahri Layer 6 AI EMAIL
Pseudocode Yes Algorithm 1: Average-k set prediction
Open Source Code Yes Additional details on tasks, datasets, models, and set prediction methods are given in Appendix A. Our code is available at github.com/layer6ai-labs/conformal-prediction-fairness.
Open Datasets Yes Using open-access datasets from the machine learning fairness literature, we created three tasks where human decision makers could potentially take advantage of model assistance. Image Classification ... FACET dataset (Gustafson et al., 2023) ... Text Classification ... Bios Bias dataset (De-Arteaga et al., 2019) ... Audio Emotion Recognition ... RAVDESS dataset (Livingstone & Russo, 2018)
Dataset Splits Yes We used the 20 most common classes and split the dataset into Dcal, Dcalval, and Dtest stratified by class. ... We selected 10 of the most common occupations and then split the dataset into Dtrain, Dval, Dcal, Dcalval, and Dtest, ensuring class balance. ... We partitioned the dataset into Dcal, Dcalval, and Dtest ensuring stratification by class (emotion) and group (binary gender)
Hardware Specification Yes We used an Intel Xeon Silver 4114 CPU and TITAN V GPU, which took in total less than 1 hour to process all three datasets.
Software Dependencies No The paper mentions several software tools and libraries like "statsmodels python package", "Psycho Py (Peirce et al., 2019)", "Pavlovia (Pavlovia, 2024)", "Optuna library (Akiba et al., 2019b)", and "Huggingface (Fadel, 2023)". However, it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes For all three tasks, we aimed to compare disparate impact between avg-k, marginal, and conditional prediction sets with target 90% coverage. ... The hyperparameters of these score functions were tuned on Dcalval with 50 iterations to minimize average set size using Optuna (Akiba et al., 2019a). ... Table 7: Hyperparameter Settings for Each Dataset After Tuning