reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MiniPLM: Knowledge Distillation for Pre-training Language Models

Authors: Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that MINIPLM boosts the student LMs performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MINIPLM extends to larger training scales, evidenced by the scaling curve extrapolation.
Researcher Affiliation	Collaboration	Yuxian Gu1,2 , Hao Zhou2, Fandong Meng2, Jie Zhou2, Minlie Huang1 1The Co AI Group, Tsinghua University 2We Chat AI, Tencent Inc., China
Pseudocode	No	The paper describes the MINIPLM training pipeline and the Difference Sampling method in text and with equations (e.g., Section 2.4, Eq. 4, Eq. 5). However, it does not include a clearly labeled pseudocode block or algorithm steps formatted as such.
Open Source Code	Yes	Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.
Open Datasets	Yes	We construct pre-training corpora from the Pile (Gao et al., 2020). [...] We also test the language modeling capability of the LMs on a subset of DCLM (Li et al., 2024a), a high-quality corpus carefully curated with complex pipelines
Dataset Splits	Yes	We construct pre-training corpora from the Pile (Gao et al., 2020). To control the computation in experiments, we pre-train all LMs on a maximum of 50B tokens, where documents are merged to construct instances with sequence lengths of 1,024. [...] In Section 3.2 and 3.3, we consider a setting where D is sufficiently large, containing 105B tokens uniformly sampled from the Pile corpus. We reserve 5B tokens as Dref, and conduct Difference Sampling as per Eq. (4) on the other 100B tokens by setting α = 0.5 to construct a 50B-token corpus D . In this way, the student LM is pre-trained on D for one epoch. [...] We sample 10K documents from the DCLM (Li et al., 2024a) corpus to construct our test set for language modeling evaluation.
Hardware Specification	Yes	All experiments are conducted on NVIDIA 40G A100 and NVIDIA 32G V100 GPUs.
Software Dependencies	No	The paper mentions using the Adam W optimizer and the LM-Eval-Harness framework, but it does not specify version numbers for these tools or any other software dependencies like programming languages or deep learning libraries (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	We set the batch size to 512 and the max sequence length to 1,024, corresponding to 100K total training steps for roughly 50B tokens in Pre-Train w/o KD, Seq KD, and MINIPLM. [...] We linearly warm up the learning rate for 2K steps and apply cosine learning rate decay until 1/10 of the max values. [...] We train all the LMs with the Adam W (Loshchilov & Hutter, 2019) optimizer, with β1 = 0.9, β2 = 0.98, and a 0.1 weight decay. [...] The 500M, 1.8B, and 4B teacher models are the officially released Qwen-1.5 checkpoints [...] Model configurations and corresponding learning rates are summarized in Table 6.