Regretful Decisions under Label Noise
Authors: Sujay Nagaraj, Yang Liu, Flavio Calmon, Berk Ustun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive empirical study on clinical prediction tasks. Our findings highlight the instance-level impact of label noise, and we demonstrate how our approach can support safer inference by flagging potential mistakes. |
| Researcher Affiliation | Academia | Sujay Nagaraj University of Toronto Yang Liu UC Santa Cruz Flavio P. Calmon Harvard SEAS Berk Ustun UC San Diego |
| Pseudocode | Yes | Algorithm 1 Generate Plausible Draws, Datasets, and Models Input noisy dataset (xi, yi)n i=1, noise model pu|y, number of models m 1, atypicality ϵ [0, 1]] Initialize ˆF plaus ϵ {} 1: repeat 2: ui Bernouilli(qu| y,x) for i [n] generate noise draw by posterior inference 3: if [u1, . . . , un] Uϵ then check if draw is plausible using Def. 6 4: ˆyi yi ui for i [n] 5: ˆD {(xi, ˆyi)}n i=1 construct plausible clean dataset 6: ˆf argminf F ˆR(f; ˆD) train plausible model 7: ˆF plaus ϵ ˆF plaus ϵ { ˆf } update plausible models 8: end if 9: until | ˆF plaus ϵ | = m Output ˆF plaus ϵ , sample of m models from the set of plausible models F plaus ϵ |
| Open Source Code | Yes | Supporting material and code can be found in Appendix B and Git Hub. |
| Open Datasets | Yes | We work with 5 classification datasets from clinical applications where models support individual medical decisions (see Table 3). ... shock_eicu n = 3, 456 d = 104 Pollard et al. [45] ... shock_mimic n = 15, 254 d = 104 Johnson et al. [22] ... lungcancer n = 62, 916 d = 40 NCI [41] ... mortality n = 20, 334 d = 84 Le Gall et al. [28] ... support n = 9, 696 d = 114 Knaus et al. [24]. We use the enhancer dataset from Gschwind et al. [16] to predict the outcome of experiments to discover enhancers... |
| Dataset Splits | Yes | We split each dataset into a training sample (80%), which we use to train a logistic regression model (LR) and a neural network (DNN) using noisy labels, and a test sample (20%), which we use to measure out-of-sample performance using true labels. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. The acknowledgments mention general computing resources from the Vector Institute but no specifics. |
| Software Dependencies | No | The paper mentions training logistic regression and neural network models but does not provide specific software dependencies, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions). |
| Experiment Setup | No | The paper mentions the noise rates used for corrupting labels ([5%, 20%, 40%]) and the two training methods (Ignore and Hedge). However, it does not provide specific hyperparameters for the logistic regression and neural network models, such as learning rates, batch sizes, optimizer details, or number of epochs. |