Progressive distillation induces an implicit curriculum
Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups. |
| Researcher Affiliation | Academia | Princeton University α Carnegie Mellon University β University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 2-stage training |
| Open Source Code | Yes | Code is available here. |
| Open Datasets | Yes | We then extend our investigation to Transformers trained on probabilistic contextfree grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our choices of PCFGs are taken from Allen-Zhu & Li (2023a) |
| Dataset Splits | Yes | Evaluation is based on a held-out set consisting of 4096 examples, and we report the average across 3 different training seeds. For one-shot distillation, we use the teacher checkpoint at the end of training (20M checkpoint), at which point the teacher has fully saturated. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments with specific models or types. |
| Software Dependencies | No | The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'cosine learning rate schedule (Loshchilov & Hutter, 2016)' but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | We use a batch size of 512 in each setting. We use Adam (Kingma & Ba, 2014) optimizer with 0 weight decay, β1, β2 = (0.9, 0.95). We use cosine decay learning rate. We extensively tune the learning rate in the grid {10−2, 7.5 × 10−3, 5 × 10−3, 2.5 × 10−3, 10−3} in each setting. We train the teacher on 4 × 106 training samples (equal to 8 × 103 steps). |