reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling

Authors: Michael Heck, Christian Geishauser, Nurul Lubis, Carel van Niekerk, Shutong Feng, Hsien-Chin Lin, Benjamin Matthias Ruppik, Renato Vukovic, Milica Gasic

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An extensive empirical evaluation on various NLP tasks that validates STORM s ability to identify noisy and ambiguous samples with high recall and low false positives. Experiments Datasets We train and evaluate on three types of NLP classification datasets.
Researcher Affiliation	Academia	Heinrich Heine University D usseldorf EMAIL
Pseudocode	Yes	Algorithm 1: STORM Data: Initial θ, initial ω, noisy training batches X, noisy validation batches V, forward passes G Result: Trained model Θθ and rescaler Ωω while training continues do // Inner loop learns Θ for each inner loop traversal do B Sample Batch(X); F Get Rescaler Features(Θθ, B, G); ℓ Forward(Θθ, B); θ Backward(Ωω(F ), ℓ); θ Optimize(θ, θ); θ θ ; end // Outer loop meta learns Ω Bval Sample Batch(V); Fval Get Rescaler Features(Θθ , Bval, G); ℓθ Forward(Θθ , Bval); metaω, outerω Backward(Ωω(Fval), ℓθ ); ω Optimize(ω, metaω, outerω); ω ω ; end
Open Source Code	Yes	Code https://gitlab.cs.uniduesseldorf.de/general/dsml/storm-public
Open Datasets	Yes	Youtube (Alberto, Lochter, and Almeida 2015) and SMS (Almeida, Hidalgo, and Yamakami 2011) are spam detection benchmarks... MRPC, Co LA and RTE are members of the GLUE benchmark (Wang et al. 2018)... Multi WOZ 2.4 (Ye, Manotumruksa, and Yilmaz 2022)
Dataset Splits	Yes	For GLUE benchmarks, no test sets are available, therefore we use 2-fold cross-validation using the validation sets. We introduce 10% to 40% random label noise into the training portions of these datasets to simulate noisy labels.
Hardware Specification	No	Computational resources were provided by the Centre for Information and Media Technology at Heinrich Heine University D usseldorf, and Google Cloud.
Software Dependencies	No	We initialize Enc( ) with Ro BERTa-base (Liu et al. 2019). All tasks are trained with cross-entropy loss using the Adam optimizer (Kingma and Ba 2015).
Experiment Setup	Yes	Optimal learning rates are determined via grid search on the original clean datasets. Multi WOZ experiments are an exception due to the lack of a clean training dataset. Learning rates are constant except for Multi WOZ, where we employ a linear schedule with 10% warmup. Maximum epochs are 10 with early stopping based on validation performance. Batch sizes B are 48 for Multi WOZ and 32 for the other datasets. The dropout rate for the transformer encoder is 10%.