Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling

Authors: Michael Heck, Christian Geishauser, Nurul Lubis, Carel van Niekerk, Shutong Feng, Hsien-Chin Lin, Benjamin Matthias Ruppik, Renato Vukovic, Milica Gasic

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An extensive empirical evaluation on various NLP tasks that validates STORM s ability to identify noisy and ambiguous samples with high recall and low false positives. Experiments Datasets We train and evaluate on three types of NLP classification datasets.
Researcher Affiliation Academia Heinrich Heine University D usseldorf EMAIL
Pseudocode Yes Algorithm 1: STORM Data: Initial θ, initial ω, noisy training batches X, noisy validation batches V, forward passes G Result: Trained model Θθ and rescaler Ωω while training continues do // Inner loop learns Θ for each inner loop traversal do B Sample Batch(X); F Get Rescaler Features(Θθ, B, G); ℓ Forward(Θθ, B); θ Backward(Ωω(F ), ℓ); θ Optimize(θ, θ); θ θ ; end // Outer loop meta learns Ω Bval Sample Batch(V); Fval Get Rescaler Features(Θθ , Bval, G); ℓθ Forward(Θθ , Bval); metaω, outerω Backward(Ωω(Fval), ℓθ ); ω Optimize(ω, metaω, outerω); ω ω ; end
Open Source Code Yes Code https://gitlab.cs.uniduesseldorf.de/general/dsml/storm-public
Open Datasets Yes Youtube (Alberto, Lochter, and Almeida 2015) and SMS (Almeida, Hidalgo, and Yamakami 2011) are spam detection benchmarks... MRPC, Co LA and RTE are members of the GLUE benchmark (Wang et al. 2018)... Multi WOZ 2.4 (Ye, Manotumruksa, and Yilmaz 2022)
Dataset Splits Yes For GLUE benchmarks, no test sets are available, therefore we use 2-fold cross-validation using the validation sets. We introduce 10% to 40% random label noise into the training portions of these datasets to simulate noisy labels.
Hardware Specification No Computational resources were provided by the Centre for Information and Media Technology at Heinrich Heine University D usseldorf, and Google Cloud.
Software Dependencies No We initialize Enc( ) with Ro BERTa-base (Liu et al. 2019). All tasks are trained with cross-entropy loss using the Adam optimizer (Kingma and Ba 2015).
Experiment Setup Yes Optimal learning rates are determined via grid search on the original clean datasets. Multi WOZ experiments are an exception due to the lack of a clean training dataset. Learning rates are constant except for Multi WOZ, where we employ a linear schedule with 10% warmup. Maximum epochs are 10 with early stopping based on validation performance. Batch sizes B are 48 for Multi WOZ and 32 for the other datasets. The dropout rate for the transformer encoder is 10%.