reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal Prediction Sets Can Cause Disparate Impact

Authors: Jesse Cresswell, Bhargava Kumar, Yi Sui, Mouloud Belbahri

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through pre-registered, randomized controlled trials with human participants, we find that prediction sets can lead to disparate impact the increases in accuracy compared to the control population are not equal across groups (Figure 1).
Researcher Affiliation	Industry	Jesse C. Cresswell Layer 6 AI EMAIL Bhargava Kumar TD Securities EMAIL Yi Sui Layer 6 AI EMAIL Mouloud Belbahri Layer 6 AI EMAIL
Pseudocode	Yes	Algorithm 1: Average-k set prediction
Open Source Code	Yes	Additional details on tasks, datasets, models, and set prediction methods are given in Appendix A. Our code is available at github.com/layer6ai-labs/conformal-prediction-fairness.
Open Datasets	Yes	Using open-access datasets from the machine learning fairness literature, we created three tasks where human decision makers could potentially take advantage of model assistance. Image Classification ... FACET dataset (Gustafson et al., 2023) ... Text Classification ... Bios Bias dataset (De-Arteaga et al., 2019) ... Audio Emotion Recognition ... RAVDESS dataset (Livingstone & Russo, 2018)
Dataset Splits	Yes	We used the 20 most common classes and split the dataset into Dcal, Dcalval, and Dtest stratified by class. ... We selected 10 of the most common occupations and then split the dataset into Dtrain, Dval, Dcal, Dcalval, and Dtest, ensuring class balance. ... We partitioned the dataset into Dcal, Dcalval, and Dtest ensuring stratification by class (emotion) and group (binary gender)
Hardware Specification	Yes	We used an Intel Xeon Silver 4114 CPU and TITAN V GPU, which took in total less than 1 hour to process all three datasets.
Software Dependencies	No	The paper mentions several software tools and libraries like "statsmodels python package", "Psycho Py (Peirce et al., 2019)", "Pavlovia (Pavlovia, 2024)", "Optuna library (Akiba et al., 2019b)", and "Huggingface (Fadel, 2023)". However, it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	For all three tasks, we aimed to compare disparate impact between avg-k, marginal, and conditional prediction sets with target 90% coverage. ... The hyperparameters of these score functions were tuned on Dcalval with 50 iterations to minimize average set size using Optuna (Akiba et al., 2019a). ... Table 7: Hyperparameter Settings for Each Dataset After Tuning