reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Deep Learning Models with Norm-Constrained LMOs

Authors: Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Abstract: Experimentally, we demonstrate significant speedups on nano GPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https: //github.com/LIONS-EPFL/scion. Section 6: Experiments
Researcher Affiliation	Academia	1LIONS, EPFL 2CVN, Universit e Paris-Saclay. Correspondence to: Thomas Pethick <EMAIL>.
Pseudocode	Yes	Algorithm 1 Unconstrained SCG (u SCG) Input: Horizon n, initialization x1 X, d0 = 0, momentum αk (0, 1], and stepsize γk (0, 1) ... Algorithm 2 Stochastic Conditional Gradient (SCG) Input: Horizon n, initialization x1 D, d0 = 0, momentum αk (0, 1], and stepsize γk (0, 1) ... Algorithm 3 (Unconstrained) Scion Input: Horizon n, init. x1 = (W 1 1 , ..., W 1 L), d0 = 0, momentum αk (0, 1], stepsize γ (0, 1), radii ρi R+. ... Algorithm 4 Averaged LMO directio Nal Descent (ALMOND) Input: Horizon n, initialization x1 X, d0 = 0, momentum α (0, 1), stepsize γ (0, 1)
Open Source Code	Yes	The code is available at https: //github.com/LIONS-EPFL/scion.
Open Datasets	Yes	Experimentally, we demonstrate significant speedups on nano GPT training using our algorithm, Scion... We train for 5100 iterations with a batchsize of 512 on the Fine Web dataset... We additionally test on vision transformers (Vi T) on Image Net and convolutional neural networks (CNN) on the CIFAR10 dataset... Dataset Fine Web... Dataset Shakespeare... Dataset CIFAR10... Dataset Image Net-1k
Dataset Splits	Yes	We train for 5100 iterations with a batchsize of 512 on the Fine Web dataset (see Table 7 regarding hyperparameters)... Figure 1: Performance on Nano GPT... Figure 2: Batch size sensitivity on Nano GPT (124M)... Figure 3: SCION leads to 30% fewer epochs for Vi T on Image Net... Table 9. Shallow MLP hyperparameters. Dataset CIFAR10 (50000 training examples)... Figure 11. (right) The optimal stepsize transfers across width.
Hardware Specification	Yes	Table 8. Shallow GPT hyperparameters... We increase the batch size to 32, which is the maximum allowed for a model with an embedding size of 4096 on an A100.
Software Dependencies	No	We provide a reference implementation in Py Torch referred to as Scion Light.
Experiment Setup	Yes	Table 7. Nano GPT hyperparameters. Hyperparameter Adam W Muon UNCONSTRAINED SCION SCION Layers 12 Head dim 128 Activation function Re LU2 Scaled Re LU2 (see Appendix E.3) Vocabulary size 50304 Dataset Fine Web batch size 512 block size 1024 Iterations n 5100 Warmdown 28.5% Stepsize schedule Constant then linear decay γk = ( γ if k < n m γ ( n k m ) if k n m Warmup 5% 0 Gradient clipping Yes No Momentum β1 / β2 0.9 / 0.95 Averaging parameter α 0.1 Muon stepsize multiplier1 0.1 Nesterov Yes Boundary init. No Radius ρ1 / ρℓ/ ρL /50 / 3000