Progressive distillation induces an implicit curriculum

Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.
Researcher Affiliation Academia Princeton University α Carnegie Mellon University β University of Pennsylvania
Pseudocode Yes Algorithm 1 2-stage training
Open Source Code Yes Code is available here.
Open Datasets Yes We then extend our investigation to Transformers trained on probabilistic contextfree grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our choices of PCFGs are taken from Allen-Zhu & Li (2023a)
Dataset Splits Yes Evaluation is based on a held-out set consisting of 4096 examples, and we report the average across 3 different training seeds. For one-shot distillation, we use the teacher checkpoint at the end of training (20M checkpoint), at which point the teacher has fully saturated.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments with specific models or types.
Software Dependencies No The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'cosine learning rate schedule (Loshchilov & Hutter, 2016)' but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup Yes We use a batch size of 512 in each setting. We use Adam (Kingma & Ba, 2014) optimizer with 0 weight decay, β1, β2 = (0.9, 0.95). We use cosine decay learning rate. We extensively tune the learning rate in the grid {10−2, 7.5 × 10−3, 5 × 10−3, 2.5 × 10−3, 10−3} in each setting. We train the teacher on 4 × 106 training samples (equal to 8 × 103 steps).