Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Authors: Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, Atish Agarwala
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our main empirical findings in this section, independently on multiple tasks and architectures which can be studied even in academic settings. 2.1. Experiment Setup |
| Researcher Affiliation | Collaboration | Shikai Qiu 1 Lechao Xiao 2 Andrew Gordon Wilson 1 Jeffrey Pennington 2 Atish Agarwala 2 1New York University 2Google Deep Mind. Correspondence to: Shikai Qiu <EMAIL>, Atish Agarwala <EMAIL>. |
| Pseudocode | No | The paper describes its methodology through prose and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Our code can be found here. |
| Open Datasets | Yes | We consider two next-token prediction tasks: 1) CIFAR-5M (Nakkiran et al., 2020), a dataset of 6M generated CIFAR-like images, and 2) Lichess, a collection of chess games recorded in algebraic chess notation where the goal is to predict the next move in the game. Our experiments on the Lichess dataset available on Hugging Face at https://huggingface.co/datasets/Lichess/standard-chess-games. |
| Dataset Splits | No | The paper mentions training on CIFAR-5M and Lichess datasets, and discusses training epochs and data reuse, but it does not specify explicit training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | Yes | SQ was supported by Google s TPU Research Cloud (TRC) program: https://sites.research.google/trc/. |
| Software Dependencies | No | The paper mentions using specific components like 'Ge LU activations', 'RMSNorm', 'µP', and 'Adam' for training, but it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | For CIFAR-5M. Following µP we parameterize the learning rate for each weight matrix as η = ηbase/D where d is the model dimension, except for the embedding matrix which has η = ηbase. We use ηbase = 4 and a = 0.1 as they led to good performance in our early experiments. We initialize the embedding matrix as W emb ij N(0, 1), the output head as W head = 0, all other non-readout matrices W as Wij N(0, 1/D). These hyperparameters were determined with a small amount of tuning in early experiments. We use a batch size of 256 images. We use a linear warmup for 1000 steps. |