Training Deep Learning Models with Norm-Constrained LMOs
Authors: Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Abstract: Experimentally, we demonstrate significant speedups on nano GPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https: //github.com/LIONS-EPFL/scion. Section 6: Experiments |
| Researcher Affiliation | Academia | 1LIONS, EPFL 2CVN, Universit e Paris-Saclay. Correspondence to: Thomas Pethick <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Unconstrained SCG (u SCG) Input: Horizon n, initialization x1 X, d0 = 0, momentum αk (0, 1], and stepsize γk (0, 1) ... Algorithm 2 Stochastic Conditional Gradient (SCG) Input: Horizon n, initialization x1 D, d0 = 0, momentum αk (0, 1], and stepsize γk (0, 1) ... Algorithm 3 (Unconstrained) Scion Input: Horizon n, init. x1 = (W 1 1 , ..., W 1 L), d0 = 0, momentum αk (0, 1], stepsize γ (0, 1), radii ρi R+. ... Algorithm 4 Averaged LMO directio Nal Descent (ALMOND) Input: Horizon n, initialization x1 X, d0 = 0, momentum α (0, 1), stepsize γ (0, 1) |
| Open Source Code | Yes | The code is available at https: //github.com/LIONS-EPFL/scion. |
| Open Datasets | Yes | Experimentally, we demonstrate significant speedups on nano GPT training using our algorithm, Scion... We train for 5100 iterations with a batchsize of 512 on the Fine Web dataset... We additionally test on vision transformers (Vi T) on Image Net and convolutional neural networks (CNN) on the CIFAR10 dataset... Dataset Fine Web... Dataset Shakespeare... Dataset CIFAR10... Dataset Image Net-1k |
| Dataset Splits | Yes | We train for 5100 iterations with a batchsize of 512 on the Fine Web dataset (see Table 7 regarding hyperparameters)... Figure 1: Performance on Nano GPT... Figure 2: Batch size sensitivity on Nano GPT (124M)... Figure 3: SCION leads to 30% fewer epochs for Vi T on Image Net... Table 9. Shallow MLP hyperparameters. Dataset CIFAR10 (50000 training examples)... Figure 11. (right) The optimal stepsize transfers across width. |
| Hardware Specification | Yes | Table 8. Shallow GPT hyperparameters... We increase the batch size to 32, which is the maximum allowed for a model with an embedding size of 4096 on an A100. |
| Software Dependencies | No | We provide a reference implementation in Py Torch referred to as Scion Light. |
| Experiment Setup | Yes | Table 7. Nano GPT hyperparameters. Hyperparameter Adam W Muon UNCONSTRAINED SCION SCION Layers 12 Head dim 128 Activation function Re LU2 Scaled Re LU2 (see Appendix E.3) Vocabulary size 50304 Dataset Fine Web batch size 512 block size 1024 Iterations n 5100 Warmdown 28.5% Stepsize schedule Constant then linear decay γk = ( γ if k < n m γ ( n k m ) if k n m Warmup 5% 0 Gradient clipping Yes No Momentum β1 / β2 0.9 / 0.95 Averaging parameter α 0.1 Muon stepsize multiplier1 0.1 Nesterov Yes Boundary init. No Radius ρ1 / ρℓ/ ρL /50 / 3000 |