A second-order-like optimizer with adaptive gradient scaling for deep learning
Authors: Jerome Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | INNAprop is evaluated on CIFAR-10, Food101, and Image Net with Res Nets, VGG, Dense Net, and Vi T. We also train GPT-2 (Open Web Text) from scratch and with Lo RA fine-tuning (E2E). INNAprop consistently offers close performance to Adam W, while performing significantly better in our LLM training experiments, achieving faster convergence and higher accuracy with minimal hyperparameter tuning, even at large scale. |
| Researcher Affiliation | Collaboration | Jérôme Bolte EMAIL Toulouse School of Economics, University of Toulouse Capitole, Toulouse, France Ryan Boustany EMAIL Toulouse School of Economics, University of Toulouse Capitole, Thales LAS France Edouard Pauwels EMAIL Toulouse School of Economics, University of Toulouse Capitole, Toulouse, France Andrei Purica EMAIL Thales LAS France |
| Pseudocode | Yes | Algorithm 1 Deep learning implementation of INNAprop Algorithm 2 INNAprop Algorithm 3 INNAprop with (α, β) = (1, 1) Algorithm 4 DINAdam |
| Open Source Code | Yes | Our code is public 1. 1https://github.com/innaprop/innaprop |
| Open Datasets | Yes | INNAprop is evaluated on CIFAR-10, Food101, and Image Net with Res Nets, VGG, Dense Net, and Vi T. We also train GPT-2 (Open Web Text) from scratch and with Lo RA fine-tuning (E2E). ...CIFAR10 (Krizhevsky & Hinton, 2010)...Image Net-1k benchmark (Krizhevsky et al., 2012)...Food101 dataset (Bossard et al., 2014)...Open Web Text dataset Gokaslan & Cohen (2019)...E2E dataset Novikova et al. (2017) |
| Dataset Splits | Yes | We fine-tune the same GPT-2 models on the E2E dataset Novikova et al. (2017), consisting of roughly 42000 training examples, 4600 test examples from the restauration domain. |
| Hardware Specification | Yes | GPUs 1 V100 (for CIFAR-10 and Food101 experiments); GPUs 4 V100 (for Res Net18 and Res Net50 Image Net experiments); GPUs 8 A100 (for Vi T/B-32 Image Net experiment); GPUs 4 A100 (for GPT-2 from scratch experiment); GPUs 1 A100 (for GPT-2 with Lo RA experiment). |
| Software Dependencies | No | The paper mentions using "PyTorch tutorial code", "optuna (Akiba et al., 2019)", "nano GPT repository" and "Lo RA codebase" but does not specify version numbers for these software components or libraries. |
| Experiment Setup | Yes | Hyperparameter tuning: We consider VGG11 (Simonyan & Zisserman, 2014) and Res Net18 (He et al., 2016) models trained on CIFAR10 (Krizhevsky & Hinton, 2010). We fix a cosine scheduler where Tmax = 200, as recommended for Adam W, and γmin = 0 (see Appendix D for more details) and consider two weight decay parameters λ = 0 or λ = 0.01 (defaut value for Adam W). We tune the initial learning rate γ0 only for Adam W. We find γ0 = 10 3, which is also the baseline value reported for Adam W in this experiment (see Appendix E). For INNAprop, we tune only α and β using γ0 = 10 3 from Adam W. Using optuna (Akiba et al., 2019), we perform a grid search over (α, β) {0.1, 0.5, 0.9, . . . , 3.5, 4.0}. Appendix G provides detailed tables of hyper-parameter values for various experiments including architecture, epochs, batch size, learning rates, weight decay, and specific settings for optimizers and training. |