reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks

Authors: Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, Atish Agarwala

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our main empirical findings in this section, independently on multiple tasks and architectures which can be studied even in academic settings. 2.1. Experiment Setup
Researcher Affiliation	Collaboration	Shikai Qiu 1 Lechao Xiao 2 Andrew Gordon Wilson 1 Jeffrey Pennington 2 Atish Agarwala 2 1New York University 2Google Deep Mind. Correspondence to: Shikai Qiu <EMAIL>, Atish Agarwala <EMAIL>.
Pseudocode	No	The paper describes its methodology through prose and mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Our code can be found here.
Open Datasets	Yes	We consider two next-token prediction tasks: 1) CIFAR-5M (Nakkiran et al., 2020), a dataset of 6M generated CIFAR-like images, and 2) Lichess, a collection of chess games recorded in algebraic chess notation where the goal is to predict the next move in the game. Our experiments on the Lichess dataset available on Hugging Face at https://huggingface.co/datasets/Lichess/standard-chess-games.
Dataset Splits	No	The paper mentions training on CIFAR-5M and Lichess datasets, and discusses training epochs and data reuse, but it does not specify explicit training, validation, or test splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification	Yes	SQ was supported by Google s TPU Research Cloud (TRC) program: https://sites.research.google/trc/.
Software Dependencies	No	The paper mentions using specific components like 'Ge LU activations', 'RMSNorm', 'µP', and 'Adam' for training, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	For CIFAR-5M. Following µP we parameterize the learning rate for each weight matrix as η = ηbase/D where d is the model dimension, except for the embedding matrix which has η = ηbase. We use ηbase = 4 and a = 0.1 as they led to good performance in our early experiments. We initialize the embedding matrix as W emb ij N(0, 1), the output head as W head = 0, all other non-readout matrices W as Wij N(0, 1/D). These hyperparameters were determined with a small amount of tuning in early experiments. We use a batch size of 256 images. We use a linear warmup for 1000 steps.