reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning under Temporal Label Noise

Authors: Sujay Nagaraj, Walter Gerych, Sana Tonekaboni, Anna Goldenberg, Berk Ustun, Thomas Hartvigsen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark our methods on a collection of temporal classification tasks from real-world applications. Our goal is to evaluate methods in terms of robustness to temporal label noise, and characterize when it is important to consider temporal variation in label noise. We include additional details on setup and results in Appendix D and code to reproduce our results on Git Hub.
Researcher Affiliation	Academia	1University of Toronto 2MIT 3Broad Institute 4UCSD 5University of Virginia Corresponding authors: EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Continuous Learning Algorithm Input: Noisy Training Dataset D, hyperparameters γ and η Output: Model θ, Temporal Noise Function ω c 1 and λ 1 for k = 1, 2, 3, . . . , do θk, ωk = arg minθ,ω L(θ, ω) Computed with SGD using the Adam optimizer λ λ + c Rt(θk, ωk) Update Lagrange multiplier if k > 0 and Rt(θk, ωk) > γRt(θk 1, ωk 1) then c c end if if Rt(θk, ωk) == 0 then break end if end for
Open Source Code	Yes	All of our code is available at https://github.com/sujaynagaraj/Temporal Label Noise
Open Datasets	Yes	We work with four real-world datasets from healthcare. Each dataset reflects binary classification tasks over a complex feature space, with labeled examples across multiple time steps, where the labels are likely to exhibit label noise. The tasks include: 1. moving: human activity recognition task where we detect movement states (e.g., walking vs. sitting) using temporal accelerometer data in adults [59]. 2. senior: similar human activity recognition task as above but in senior citizens [44] 3. sleeping: sleep state detection (e.g., light sleep vs. REM) task using continuous EEG data [22] 4. blinking: eye movement (open vs. closed) detection task using continuous EEG data [60] from continuous EEG data.
Dataset Splits	Yes	We split each dataset into a noisy training sample (80%, used to train the models and correct for label noise) and a clean test sample (20%, used to compute unbiased estimates of out-of-sample performance).
Hardware Specification	No	The paper does not provide specific hardware details such as CPU/GPU models, memory, or detailed computer specifications used for running the experiments.
Software Dependencies	No	The paper mentions using the 'adam optimizer' and 'GRU' (Gated Recurrent Unit) and implies standard deep learning libraries, but it does not specify version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	We train each model for 150 epochs using the adam optimizer with default parameters and a learning rate of 0.01. We train all models using the same set of hyperparameters for experiment, set the batch size manually for each dataset, and avoid hyperparameter tuning to avoid label leakages. For Vol Min Net, Discontinuous, and Continuous we use adam optimizer with default parameters and a learning rate of 0.01 to optimize each respective ˆ Qt-estimation technique. λ was set to 1e-4 for Vol Min Net and Discontinuous for all experiments following the setup of Li et al. [38]. They also describe a two-stage approach: 1) estimate the anchor points after a warmup period 2) use the anchor points to train the classifier with forward corrected loss. We set the warmup period to 25 epochs. For all experiments we set λ = 1,c = 1, γ = 2, and η = 2. k and the maximum number of SGD iterations are set to 15 and 10, respectively. This is to ensure that the total number of epochs is 150, which is the max number of epochs used for all experiments.