reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modular Duality in Deep Learning

Authors: Jeremy Bernstein, Laker Newhouse

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training Nano GPT and was scaled up to a 1.5 billion parameter transformer. ... Third contribution. We run two experiments.
Researcher Affiliation	Academia	1MIT CSAIL, United States. Correspondence to: Jeremy Bernstein <EMAIL>, Laker Newhouse <EMAIL>.
Pseudocode	No	The paper describes methods and mathematical derivations but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 5.3 describes the Rectangular Newton-Schulz Iteration but in a descriptive text format rather than a structured algorithm block.
Open Source Code	Yes	Our methods were used in the Muon optimizer, which recently set speed records for training Nano GPT and was scaled up to a 1.5 billion parameter transformer. ... In fact, based on our work, a new Nano GPT training speed record was recently set using a Newton-Schulz-based duality map, packaged into an open-source optimizer called Muon (Jordan et al., 2024b).
Open Datasets	Yes	Datasets. The dataset for all experiments is CIFAR-10 (Krizhevsky & Hinton, 2009).
Dataset Splits	Yes	We use the standard train and test splits with no data augmentation.
Hardware Specification	No	The paper mentions that Newton-Schulz iterations can run in bfloat16, suggesting GPU usage for some related work ("Nano GPT speedruns"), but it does not specify any particular GPU models, CPU types, or other hardware specifications used for the experiments presented in this paper.
Software Dependencies	No	The hyperparameters for Adam are the default ones in Py Torch... All other hyperparameters are the default ones in Py Torch. The paper mentions Py Torch but does not provide any version numbers for it or any other software libraries.
Experiment Setup	Yes	The architecture for all experiments is a 3-layer MLP with a Re LU nonlinearity, one hidden layer, and no biases. ... We run experiments with hidden layer widths 32, 64, 128, 256, 512, 1024, 2048, and 4096. For each width, we sweep between 10 and 20 different learning rates. We train for 20 epochs with batch size 128. ... We use orthogonal weight initialization. Concretely, we create weight matrices with unit Gaussian entries and then iterate them through Newton-Schulz for 30 steps. Our duality-based optimizer in this experiment uses no momentum. It passes the raw gradient through the duality map UΣV J ÞÑ a dout{din UV J, implemented via 5 steps of Newton-Schulz iteration and then multiplying by the dimensional constant. We use a quintic Newton-Schulz iteration with coefficients p2, 1.5, 0.5q.