On the Provable Separation of Scales in Maximal Update Parameterization

Authors: Letong Hong, Zhangyang Wang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a proof of concept run confirming the qualitative macro micro scale separation predicted by Theorem 4.1. Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}. We log (i) the training loss L(t; η) and (ii) the squared weight drift µ(t; η) 2 = θt θ0 2 every 10 epochs.
Researcher Affiliation Industry 1XTY AI Labs, XTX Markets. Authors are listed in alphabetic order (α-β). Correspondence to: Atlas Wang <EMAIL>.
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but no structured pseudocode or algorithm blocks are explicitly provided.
Open Source Code No The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets Yes Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10.
Dataset Splits No The paper describes the model and training parameters but does not specify how CIFAR-10 was split into training, validation, or test sets, nor does it refer to standard splits or percentages.
Hardware Specification No The paper describes the experimental setup in Section B 'Simulations' but does not specify any hardware details like GPU models, CPU types, or other computing resources used.
Software Dependencies No The paper mentions training with 'vanilla SGD' and 'cross-entropy loss' but does not specify any software libraries or their version numbers used for implementation.
Experiment Setup Yes Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}.