On the Generalization Ability of Next-Token-Prediction Pretraining
Authors: Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, Feng Zheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, experiments on public datasets verify our theoretical findings. Our code is available at https://github.com/Lizeihao/MININTP. To validate the theoretical contribution of this paper, specifically, Theorem 4.22, we performed a set of NTP pre-training experiments in DOMs. These experiments were designed to systematically examine the influence of model parameters and sample size on generalization performance. |
| Researcher Affiliation | Academia | 1College of Informatics, Huazhong Agricultural University 2Department of Computer Science and Engineering, Southern University of Science and Technology 3Department of Computer Science, Hong Kong Baptist University 4Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education, China. Correspondence to: Hong Chen <EMAIL>. |
| Pseudocode | No | The paper defines the architecture of DOMs (Figure 1b, 1c) and presents mathematical formulations for its components (Equations 4, 5, 6, 7), but it does not include any specific section or figure labeled 'Pseudocode' or 'Algorithm', nor does it present structured, step-by-step procedures in a code-like format. |
| Open Source Code | Yes | Our code is available at https://github.com/Lizeihao/MININTP. |
| Open Datasets | Yes | For pretraining, we employ the Mini Mind dataset1, while our test set consists of 8,192 samples (with a maximum sequence length of m 512) carefully selected from the DAMO NLP dataset2. 1https://www.modelscope.cn/datasets/ gongjy/minimind_dataset 2https://www.modelscope.cn/datasets/DAMO_ NLP/lcsts_test_set |
| Dataset Splits | Yes | For pretraining, we employ the Mini Mind dataset1, while our test set consists of 8,192 samples (with a maximum sequence length of m 512) carefully selected from the DAMO NLP dataset2. (...) we evaluated performance across 50%, 75%, and 100% subsets of the complete pretraining dataset. |
| Hardware Specification | Yes | To optimize efficiency, we employ Flash Attention (Dao et al., 2022) for accelerated attention computation and conduct distributed training on 8 NVIDIA A800-80GB GPUs using Deep Speed-Zero2 (Rajbhandari et al., 2020). |
| Software Dependencies | No | To optimize efficiency, we employ Flash Attention (Dao et al., 2022) for accelerated attention computation and conduct distributed training on 8 NVIDIA A800-80GB GPUs using Deep Speed-Zero2 (Rajbhandari et al., 2020). For optimization, we utilized the Adam W (Loshchilov & Hutter, 2017) optimizer, combined with a cosine learning rate scheduler that includes a 20-step warm-up phase during the initial training stage. The paper mentions software components like Flash Attention, Deep Speed-Zero2, and AdamW optimizer, but it does not provide specific version numbers for any of these components. |
| Experiment Setup | Yes | Our training methodology follows the approach outlined in Mini Mind. To optimize efficiency, we employ Flash Attention (Dao et al., 2022) for accelerated attention computation and conduct distributed training on 8 NVIDIA A800-80GB GPUs using Deep Speed-Zero2 (Rajbhandari et al., 2020). For optimization, we utilized the Adam W (Loshchilov & Hutter, 2017) optimizer, combined with a cosine learning rate scheduler that includes a 20-step warm-up phase during the initial training stage. (...) Table 3. Model architectures, training data specifications, hyperparameter configurations, and test PPL (m = 512). |