reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distillation Scaling Laws

Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russell Webb

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To close this knowledge gap, we conduct an comprehensive, controlled study of distillation, with transformer students and teachers ranging from 143M to 12.6B parameters, trained on data of a few billion to 512B tokens. These experiments yield our distillation scaling law, which estimates student performance as a function of resources (the teacher, the student size, and the amount of distillation data).
Researcher Affiliation	Collaboration	1Apple 2University of Oxford, UK. Work done during an internship at Apple. For a full breakdown of contributions see Appendix J. Correspondence to: Dan Busbridge <EMAIL>.
Pseudocode	Yes	def find(vector, value): """Find locations of value in vector.""" return np.where(vector == value)[0] def remove(vector, value): """Find value from vector.""" return np.delete(vector, find(vector, value)) def label(vector: np.ndarray, num_classes: int) -> np.ndarray: """Return the label in [0, num_classes) for vector.""" assert len(vector) == 2 * num_classes one_hot = vector[num_classes:] context = vector[:num_classes] i = find(one_hot, 1) if context[i] == 0: return i else: # remapping c = context[i] return remove(find(context, c), i)
Open Source Code	No	The paper mentions using "AXLearn (Apple, 2023)" and an "internal version of the open-source lm-evaluation-harness (Gao et al., 2024)", which are tools the authors used. However, it does not contain an explicit statement by the authors about releasing the source code for the methodology described in this specific paper, nor does it provide a direct link to such a repository.
Open Datasets	Yes	We use the English-only subset of the C4 dataset (Raffel et al., 2020) for all experiments.
Dataset Splits	Yes	For all distillation trainings, the teacher is trained on a different split from the student. The C4 dataset has roughly 180B tokens in total, which results in 90B unique tokens for the teacher training and 90B unique tokens for the student training.
Hardware Specification	No	The paper mentions 'Apple infrastructure' in the acknowledgments but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies	Yes	Constrained numerical minimization using Sequential Least SQuares Programming (SLSQP) (Kraft, 1988) in Sci Py (Virtanen et al., 2019). All models are evaluated using an internal version of the open-source lm-evaluation-harness (Gao et al., 2024).
Experiment Setup	Yes	All models use decoupled weight decay Loshchilov & Hutter (2019) of 10^-4 for regularization, as well as a simplified version of µP... Because of µP (simple), we fix the learning rate to 1e-2 across all model sizes. We train all models with a sequence length of 4096, with Ro PE (Su et al., 2024) positional embeddings (base frequency set to 500k). Unless explicitly stated, model are trained on 500-512, or 20N samples, where N is the number of model parameters, whichever is larger. For all distillation trainings, the teacher is trained on a different split from the student. We train MLPs with two hidden layers of equal width, all non-linearities are Rectified Linear Units (Re LUs). All model are trained with Adam (Kingma & Ba, 2015) using a peak learning rate of 3x10^-4, a single cycle cosine learning rate schedule with a linear warmup of 5% of the total training steps. A batch size of 512 is used for all models. All model architectures in this work are presented in Table 13, have a fixed aspect ratio dmodel = 128 and a fixed ffn ratio ρffn = 8/3 coupled with gated linear activation (nffn = 3).