reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Provable Separation of Scales in Maximal Update Parameterization

Authors: Letong Hong, Zhangyang Wang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide a proof of concept run confirming the qualitative macro micro scale separation predicted by Theorem 4.1. Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}. We log (i) the training loss L(t; η) and (ii) the squared weight drift µ(t; η) 2 = θt θ0 2 every 10 epochs.
Researcher Affiliation	Industry	1XTY AI Labs, XTX Markets. Authors are listed in alphabetic order (α-β). Correspondence to: Atlas Wang <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations, but no structured pseudocode or algorithm blocks are explicitly provided.
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets	Yes	Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10.
Dataset Splits	No	The paper describes the model and training parameters but does not specify how CIFAR-10 was split into training, validation, or test sets, nor does it refer to standard splits or percentages.
Hardware Specification	No	The paper describes the experimental setup in Section B 'Simulations' but does not specify any hardware details like GPU models, CPU types, or other computing resources used.
Software Dependencies	No	The paper mentions training with 'vanilla SGD' and 'cross-entropy loss' but does not specify any software libraries or their version numbers used for implementation.
Experiment Setup	Yes	Setup. Two layer MLP, input dimension 3072 (flattened CIFAR-10), hidden width n = 10 000 (Re LU), output width 10. Weights are initialized with µP scaling, trained for 300 epochs using vanilla SGD (batch 128, no momentum/decay) and cross-entropy loss. Learning-rate grid η {0.01, 0.02, . . . , 0.30}.