reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Progressive distillation induces an implicit curriculum

Authors: Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.
Researcher Affiliation	Academia	Princeton University α Carnegie Mellon University β University of Pennsylvania
Pseudocode	Yes	Algorithm 1 2-stage training
Open Source Code	Yes	Code is available here.
Open Datasets	Yes	We then extend our investigation to Transformers trained on probabilistic contextfree grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our choices of PCFGs are taken from Allen-Zhu & Li (2023a)
Dataset Splits	Yes	Evaluation is based on a held-out set consisting of 4096 examples, and we report the average across 3 different training seeds. For one-shot distillation, we use the teacher checkpoint at the end of training (20M checkpoint), at which point the teacher has fully saturated.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments with specific models or types.
Software Dependencies	No	The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'cosine learning rate schedule (Loshchilov & Hutter, 2016)' but does not provide specific version numbers for software libraries or frameworks like PyTorch, TensorFlow, or Python.
Experiment Setup	Yes	We use a batch size of 512 in each setting. We use Adam (Kingma & Ba, 2014) optimizer with 0 weight decay, β1, β2 = (0.9, 0.95). We use cosine decay learning rate. We extensively tune the learning rate in the grid {10−2, 7.5 × 10−3, 5 × 10−3, 2.5 × 10−3, 10−3} in each setting. We train the teacher on 4 × 106 training samples (equal to 8 × 103 steps).