MiniPLM: Knowledge Distillation for Pre-training Language Models

Authors: Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that MINIPLM boosts the student LMs performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MINIPLM extends to larger training scales, evidenced by the scaling curve extrapolation.
Researcher Affiliation Collaboration Yuxian Gu1,2 , Hao Zhou2, Fandong Meng2, Jie Zhou2, Minlie Huang1 1The Co AI Group, Tsinghua University 2We Chat AI, Tencent Inc., China
Pseudocode No The paper describes the MINIPLM training pipeline and the Difference Sampling method in text and with equations (e.g., Section 2.4, Eq. 4, Eq. 5). However, it does not include a clearly labeled pseudocode block or algorithm steps formatted as such.
Open Source Code Yes Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.
Open Datasets Yes We construct pre-training corpora from the Pile (Gao et al., 2020). [...] We also test the language modeling capability of the LMs on a subset of DCLM (Li et al., 2024a), a high-quality corpus carefully curated with complex pipelines
Dataset Splits Yes We construct pre-training corpora from the Pile (Gao et al., 2020). To control the computation in experiments, we pre-train all LMs on a maximum of 50B tokens, where documents are merged to construct instances with sequence lengths of 1,024. [...] In Section 3.2 and 3.3, we consider a setting where D is sufficiently large, containing 105B tokens uniformly sampled from the Pile corpus. We reserve 5B tokens as Dref, and conduct Difference Sampling as per Eq. (4) on the other 100B tokens by setting α = 0.5 to construct a 50B-token corpus D . In this way, the student LM is pre-trained on D for one epoch. [...] We sample 10K documents from the DCLM (Li et al., 2024a) corpus to construct our test set for language modeling evaluation.
Hardware Specification Yes All experiments are conducted on NVIDIA 40G A100 and NVIDIA 32G V100 GPUs.
Software Dependencies No The paper mentions using the Adam W optimizer and the LM-Eval-Harness framework, but it does not specify version numbers for these tools or any other software dependencies like programming languages or deep learning libraries (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We set the batch size to 512 and the max sequence length to 1,024, corresponding to 100K total training steps for roughly 50B tokens in Pre-Train w/o KD, Seq KD, and MINIPLM. [...] We linearly warm up the learning rate for 2K steps and apply cosine learning rate decay until 1/10 of the max values. [...] We train all the LMs with the Adam W (Loshchilov & Hutter, 2019) optimizer, with β1 = 0.9, β2 = 0.98, and a 0.1 weight decay. [...] The 500M, 1.8B, and 4B teacher models are the officially released Qwen-1.5 checkpoints [...] Model configurations and corresponding learning rates are summarized in Table 6.