Distillation Scaling Laws
Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russell Webb
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To close this knowledge gap, we conduct an comprehensive, controlled study of distillation, with transformer students and teachers ranging from 143M to 12.6B parameters, trained on data of a few billion to 512B tokens. These experiments yield our distillation scaling law, which estimates student performance as a function of resources (the teacher, the student size, and the amount of distillation data). |
| Researcher Affiliation | Collaboration | 1Apple 2University of Oxford, UK. Work done during an internship at Apple. For a full breakdown of contributions see Appendix J. Correspondence to: Dan Busbridge <EMAIL>. |
| Pseudocode | Yes | def find(vector, value): """Find locations of value in vector.""" return np.where(vector == value)[0] def remove(vector, value): """Find value from vector.""" return np.delete(vector, find(vector, value)) def label(vector: np.ndarray, num_classes: int) -> np.ndarray: """Return the label in [0, num_classes) for vector.""" assert len(vector) == 2 * num_classes one_hot = vector[num_classes:] context = vector[:num_classes] i = find(one_hot, 1) if context[i] == 0: return i else: # remapping c = context[i] return remove(find(context, c), i) |
| Open Source Code | No | The paper mentions using "AXLearn (Apple, 2023)" and an "internal version of the open-source lm-evaluation-harness (Gao et al., 2024)", which are tools the authors used. However, it does not contain an explicit statement by the authors about releasing the source code for the methodology described in *this specific paper*, nor does it provide a direct link to such a repository. |
| Open Datasets | Yes | We use the English-only subset of the C4 dataset (Raffel et al., 2020) for all experiments. |
| Dataset Splits | Yes | For all distillation trainings, the teacher is trained on a different split from the student. The C4 dataset has roughly 180B tokens in total, which results in 90B unique tokens for the teacher training and 90B unique tokens for the student training. |
| Hardware Specification | No | The paper mentions 'Apple infrastructure' in the acknowledgments but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | Yes | Constrained numerical minimization using Sequential Least SQuares Programming (SLSQP) (Kraft, 1988) in Sci Py (Virtanen et al., 2019). All models are evaluated using an internal version of the open-source lm-evaluation-harness (Gao et al., 2024). |
| Experiment Setup | Yes | All models use decoupled weight decay Loshchilov & Hutter (2019) of 10^-4 for regularization, as well as a simplified version of µP... Because of µP (simple), we fix the learning rate to 1e-2 across all model sizes. We train all models with a sequence length of 4096, with Ro PE (Su et al., 2024) positional embeddings (base frequency set to 500k). Unless explicitly stated, model are trained on 500-512, or 20N samples, where N is the number of model parameters, whichever is larger. For all distillation trainings, the teacher is trained on a different split from the student. We train MLPs with two hidden layers of equal width, all non-linearities are Rectified Linear Units (Re LUs). All model are trained with Adam (Kingma & Ba, 2015) using a peak learning rate of 3x10^-4, a single cycle cosine learning rate schedule with a linear warmup of 5% of the total training steps. A batch size of 512 is used for all models. All model architectures in this work are presented in Table 13, have a fixed aspect ratio dmodel = 128 and a fixed ffn ratio ρffn = 8/3 coupled with gated linear activation (nffn = 3). |