Meta-Learning Adaptive Loss Functions

Authors: Christian Raymond, Qi Chen, Bing XUE, Mengjie Zhang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results show that our proposed method consistently outperforms the crossentropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets. In this section, the experimental setup for evaluating Ada LFL is presented. In summary experiments are conducted across seven open-access datasets and multiple well-established network architectures. The performance of Ada LFL is assessed against three benchmark methods.
Researcher Affiliation Academia Christian Raymond EMAIL Victoria University of Wellington Qi Chen EMAIL Victoria University of Wellington Bing Xue EMAIL Victoria University of Wellington Mengjie Zhang EMAIL Victoria University of Wellington
Pseudocode Yes Algorithm 1 Loss Function Initialization (Offline) Algorithm 2 Loss Function Adaptation (Online) Algorithm 3 Learning Rate Initialization (Offline) Algorithm 4 Learning Rate Adaptation (Online)
Open Source Code Yes All experiments are implemented in Py Torch (Paszke et al., 2017) and Higher (Grefenstette et al., 2019), and the code is available at 1. Git Hub Repository: https://github.com/Decadz/Online-Loss-Function-Learning
Open Datasets Yes Following the established literature on loss function learning, the regression datasets Communities and Crime (Redmond, 2009), Diabetes (Efron et al., 2004), and California Housing (Pace & Barry, 1997) are used as a simple domain to illustrate the capabilities of the proposed method. Following this classification datasets MNIST (Le Cun et al., 1998), CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), and SVHN (Netzer et al., 2011), are employed to assess the performance of Ada LFL to determine whether the results can generalize to larger, more challenging tasks.
Dataset Splits Yes The original training-testing partitioning is used for all datasets, with 10% of the training instances allocated for validation.
Hardware Specification Yes Table 2: Average run-time of the entire learning process (end-to-end) for each benchmark method. Each algorithm is run on a single Nvidia RTX A5000, and results are reported in hours.
Software Dependencies No The paper mentions "All experiments are implemented in Py Torch (Paszke et al., 2017) and Higher (Grefenstette et al., 2019)", but it does not specify explicit version numbers for PyTorch or Higher. The years in parentheses refer to the publication dates of related papers, not the software versions used.
Experiment Setup Yes In the inner loop, all regression models are trained using stochastic gradient descent (SGD) with a base learning rate of α = 0.001. Classification models are trained with SGD using a base learning rate of α = 0.01, and on CIFAR-10, CIFAR-100, and SVHN, Nesterov momentum 0.9 and weight decay 0.0005 are applied. ... To initialize Mϕ, Sinit = 2500 steps are taken in offline mode with a meta learning rate of η = 1e 3. In contrast, in online mode, a meta learning rate of η = 1e 5 is used (note, a high meta learning rate in online mode can cause a jittering effect in the loss function, which can cause training instability). For meta-optimization, the Adam optimizer (Kingma & Ba, 2015) is used in the outer loop for both initialization and online adaptation.