Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling
Authors: Michael Heck, Christian Geishauser, Nurul Lubis, Carel van Niekerk, Shutong Feng, Hsien-Chin Lin, Benjamin Matthias Ruppik, Renato Vukovic, Milica Gasic
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An extensive empirical evaluation on various NLP tasks that validates STORM s ability to identify noisy and ambiguous samples with high recall and low false positives. Experiments Datasets We train and evaluate on three types of NLP classification datasets. |
| Researcher Affiliation | Academia | Heinrich Heine University D usseldorf EMAIL |
| Pseudocode | Yes | Algorithm 1: STORM Data: Initial θ, initial ω, noisy training batches X, noisy validation batches V, forward passes G Result: Trained model Θθ and rescaler Ωω while training continues do // Inner loop learns Θ for each inner loop traversal do B Sample Batch(X); F Get Rescaler Features(Θθ, B, G); ℓ Forward(Θθ, B); θ Backward(Ωω(F ), ℓ); θ Optimize(θ, θ); θ θ ; end // Outer loop meta learns Ω Bval Sample Batch(V); Fval Get Rescaler Features(Θθ , Bval, G); ℓθ Forward(Θθ , Bval); metaω, outerω Backward(Ωω(Fval), ℓθ ); ω Optimize(ω, metaω, outerω); ω ω ; end |
| Open Source Code | Yes | Code https://gitlab.cs.uniduesseldorf.de/general/dsml/storm-public |
| Open Datasets | Yes | Youtube (Alberto, Lochter, and Almeida 2015) and SMS (Almeida, Hidalgo, and Yamakami 2011) are spam detection benchmarks... MRPC, Co LA and RTE are members of the GLUE benchmark (Wang et al. 2018)... Multi WOZ 2.4 (Ye, Manotumruksa, and Yilmaz 2022) |
| Dataset Splits | Yes | For GLUE benchmarks, no test sets are available, therefore we use 2-fold cross-validation using the validation sets. We introduce 10% to 40% random label noise into the training portions of these datasets to simulate noisy labels. |
| Hardware Specification | No | Computational resources were provided by the Centre for Information and Media Technology at Heinrich Heine University D usseldorf, and Google Cloud. |
| Software Dependencies | No | We initialize Enc( ) with Ro BERTa-base (Liu et al. 2019). All tasks are trained with cross-entropy loss using the Adam optimizer (Kingma and Ba 2015). |
| Experiment Setup | Yes | Optimal learning rates are determined via grid search on the original clean datasets. Multi WOZ experiments are an exception due to the lack of a clean training dataset. Learning rates are constant except for Multi WOZ, where we employ a linear schedule with 10% warmup. Maximum epochs are 10 with early stopping based on validation performance. Batch sizes B are 48 for Multi WOZ and 32 for the other datasets. The dropout rate for the transformer encoder is 10%. |