Modular Duality in Deep Learning

Authors: Jeremy Bernstein, Laker Newhouse

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training Nano GPT and was scaled up to a 1.5 billion parameter transformer. ... Third contribution. We run two experiments.
Researcher Affiliation Academia 1MIT CSAIL, United States. Correspondence to: Jeremy Bernstein <EMAIL>, Laker Newhouse <EMAIL>.
Pseudocode No The paper describes methods and mathematical derivations but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 5.3 describes the Rectangular Newton-Schulz Iteration but in a descriptive text format rather than a structured algorithm block.
Open Source Code Yes Our methods were used in the Muon optimizer, which recently set speed records for training Nano GPT and was scaled up to a 1.5 billion parameter transformer. ... In fact, based on our work, a new Nano GPT training speed record was recently set using a Newton-Schulz-based duality map, packaged into an open-source optimizer called Muon (Jordan et al., 2024b).
Open Datasets Yes Datasets. The dataset for all experiments is CIFAR-10 (Krizhevsky & Hinton, 2009).
Dataset Splits Yes We use the standard train and test splits with no data augmentation.
Hardware Specification No The paper mentions that Newton-Schulz iterations can run in bfloat16, suggesting GPU usage for some related work ("Nano GPT speedruns"), but it does not specify any particular GPU models, CPU types, or other hardware specifications used for the experiments presented in this paper.
Software Dependencies No The hyperparameters for Adam are the default ones in Py Torch... All other hyperparameters are the default ones in Py Torch. The paper mentions Py Torch but does not provide any version numbers for it or any other software libraries.
Experiment Setup Yes The architecture for all experiments is a 3-layer MLP with a Re LU nonlinearity, one hidden layer, and no biases. ... We run experiments with hidden layer widths 32, 64, 128, 256, 512, 1024, 2048, and 4096. For each width, we sweep between 10 and 20 different learning rates. We train for 20 epochs with batch size 128. ... We use orthogonal weight initialization. Concretely, we create weight matrices with unit Gaussian entries and then iterate them through Newton-Schulz for 30 steps. Our duality-based optimizer in this experiment uses no momentum. It passes the raw gradient through the duality map UΣV J ÞÑ a dout{din UV J, implemented via 5 steps of Newton-Schulz iteration and then multiplying by the dimensional constant. We use a quintic Newton-Schulz iteration with coefficients p2, 1.5, 0.5q.