Regretful Decisions under Label Noise

Authors: Sujay Nagaraj, Yang Liu, Flavio Calmon, Berk Ustun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive empirical study on clinical prediction tasks. Our findings highlight the instance-level impact of label noise, and we demonstrate how our approach can support safer inference by flagging potential mistakes.
Researcher Affiliation Academia Sujay Nagaraj University of Toronto Yang Liu UC Santa Cruz Flavio P. Calmon Harvard SEAS Berk Ustun UC San Diego
Pseudocode Yes Algorithm 1 Generate Plausible Draws, Datasets, and Models Input noisy dataset (xi, yi)n i=1, noise model pu|y, number of models m 1, atypicality ϵ [0, 1]] Initialize ˆF plaus ϵ {} 1: repeat 2: ui Bernouilli(qu| y,x) for i [n] generate noise draw by posterior inference 3: if [u1, . . . , un] Uϵ then check if draw is plausible using Def. 6 4: ˆyi yi ui for i [n] 5: ˆD {(xi, ˆyi)}n i=1 construct plausible clean dataset 6: ˆf argminf F ˆR(f; ˆD) train plausible model 7: ˆF plaus ϵ ˆF plaus ϵ { ˆf } update plausible models 8: end if 9: until | ˆF plaus ϵ | = m Output ˆF plaus ϵ , sample of m models from the set of plausible models F plaus ϵ
Open Source Code Yes Supporting material and code can be found in Appendix B and Git Hub.
Open Datasets Yes We work with 5 classification datasets from clinical applications where models support individual medical decisions (see Table 3). ... shock_eicu n = 3, 456 d = 104 Pollard et al. [45] ... shock_mimic n = 15, 254 d = 104 Johnson et al. [22] ... lungcancer n = 62, 916 d = 40 NCI [41] ... mortality n = 20, 334 d = 84 Le Gall et al. [28] ... support n = 9, 696 d = 114 Knaus et al. [24]. We use the enhancer dataset from Gschwind et al. [16] to predict the outcome of experiments to discover enhancers...
Dataset Splits Yes We split each dataset into a training sample (80%), which we use to train a logistic regression model (LR) and a neural network (DNN) using noisy labels, and a test sample (20%), which we use to measure out-of-sample performance using true labels.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. The acknowledgments mention general computing resources from the Vector Institute but no specifics.
Software Dependencies No The paper mentions training logistic regression and neural network models but does not provide specific software dependencies, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup No The paper mentions the noise rates used for corrupting labels ([5%, 20%, 40%]) and the two training methods (Ignore and Hedge). However, it does not provide specific hyperparameters for the logistic regression and neural network models, such as learning rates, batch sizes, optimizer details, or number of epochs.