reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Regretful Decisions under Label Noise

Authors: Sujay Nagaraj, Yang Liu, Flavio Calmon, Berk Ustun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a comprehensive empirical study on clinical prediction tasks. Our findings highlight the instance-level impact of label noise, and we demonstrate how our approach can support safer inference by flagging potential mistakes.
Researcher Affiliation	Academia	Sujay Nagaraj University of Toronto Yang Liu UC Santa Cruz Flavio P. Calmon Harvard SEAS Berk Ustun UC San Diego
Pseudocode	Yes	Algorithm 1 Generate Plausible Draws, Datasets, and Models Input noisy dataset (xi, yi)n i=1, noise model pu\|y, number of models m 1, atypicality ϵ [0, 1]] Initialize ˆF plaus ϵ {} 1: repeat 2: ui Bernouilli(qu\| y,x) for i [n] generate noise draw by posterior inference 3: if [u1, . . . , un] Uϵ then check if draw is plausible using Def. 6 4: ˆyi yi ui for i [n] 5: ˆD {(xi, ˆyi)}n i=1 construct plausible clean dataset 6: ˆf argminf F ˆR(f; ˆD) train plausible model 7: ˆF plaus ϵ ˆF plaus ϵ { ˆf } update plausible models 8: end if 9: until \| ˆF plaus ϵ \| = m Output ˆF plaus ϵ , sample of m models from the set of plausible models F plaus ϵ
Open Source Code	Yes	Supporting material and code can be found in Appendix B and Git Hub.
Open Datasets	Yes	We work with 5 classification datasets from clinical applications where models support individual medical decisions (see Table 3). ... shock_eicu n = 3, 456 d = 104 Pollard et al. [45] ... shock_mimic n = 15, 254 d = 104 Johnson et al. [22] ... lungcancer n = 62, 916 d = 40 NCI [41] ... mortality n = 20, 334 d = 84 Le Gall et al. [28] ... support n = 9, 696 d = 114 Knaus et al. [24]. We use the enhancer dataset from Gschwind et al. [16] to predict the outcome of experiments to discover enhancers...
Dataset Splits	Yes	We split each dataset into a training sample (80%), which we use to train a logistic regression model (LR) and a neural network (DNN) using noisy labels, and a test sample (20%), which we use to measure out-of-sample performance using true labels.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. The acknowledgments mention general computing resources from the Vector Institute but no specifics.
Software Dependencies	No	The paper mentions training logistic regression and neural network models but does not provide specific software dependencies, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup	No	The paper mentions the noise rates used for corrupting labels ([5%, 20%, 40%]) and the two training methods (Ignore and Hedge). However, it does not provide specific hyperparameters for the logistic regression and neural network models, such as learning rates, batch sizes, optimizer details, or number of epochs.