On the Provable Separation of Scales in Maximal Update Parameterization
Authors: Letong Hong, Zhangyang Wang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a proof of concept run confirming the qualitative macro micro scale separation predicted by Theorem 4.1. Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}. We log (i) the training loss L(t; η) and (ii) the squared weight drift µ(t; η) 2 = θt θ0 2 every 10 epochs. |
| Researcher Affiliation | Industry | 1XTY AI Labs, XTX Markets. Authors are listed in alphabetic order (α-β). Correspondence to: Atlas Wang <EMAIL>. |
| Pseudocode | No | The paper describes methods using mathematical equations and textual explanations, but no structured pseudocode or algorithm blocks are explicitly provided. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. |
| Dataset Splits | No | The paper describes the model and training parameters but does not specify how CIFAR-10 was split into training, validation, or test sets, nor does it refer to standard splits or percentages. |
| Hardware Specification | No | The paper describes the experimental setup in Section B 'Simulations' but does not specify any hardware details like GPU models, CPU types, or other computing resources used. |
| Software Dependencies | No | The paper mentions training with 'vanilla SGD' and 'cross-entropy loss' but does not specify any software libraries or their version numbers used for implementation. |
| Experiment Setup | Yes | Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}. |