Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lostin-the-middle scenarios, outperforming NTP by 11.77% percentage points. To evaluate the effectiveness of MEAP in training LLMs, we conduct controlled experiments comparing LLMs pretrained/fine-tuned by MEAP with those trained by NTP. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, University of Chinese Academy of Sciences, China 2SCITIX (SGP) TECH PTE. LTD., Singapore 3South China Normal University, China 4University of Texas at Austin, USA 5Sun Yat Sen University, China 6University of Oxford, UK. |
| Pseudocode | No | The paper describes the MEAP algorithm in Section 3 "Mask-Enhanced Autoregressive Prediction" using formal notation and descriptive text, including mathematical formulas for next-token prediction, but it does not present it within a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/scitix/MEAP. |
| Open Datasets | Yes | For key information retrieval, we choose the wellestablished Needle-in-a-Haystack evaluation (Liu et al., 2024b)... |
| Dataset Splits | No | The paper uses several benchmark datasets and evaluation frameworks (e.g., LM Eval Harness, Needle-in-a-Haystack, Multi-Document Question Answering) which imply predefined evaluation protocols, but it does not explicitly state the training, validation, and test splits used by the authors for their experiments in terms of percentages or sample counts. For the contextual hallucination evaluation, it mentions using "100 random samples per dataset" but not dataset splits for training/testing. |
| Hardware Specification | No | The paper mentions training parameters and model architecture details but does not specify the exact hardware components (e.g., GPU models, CPU models, or specific cloud instance types) used for the experiments. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer, Deep Speed Zero Stage 2, and the Llama-3 tokenizer, but does not provide specific version numbers for these software components or other key libraries like PyTorch or HuggingFace Transformers, which are typically required for reproducibility. |
| Experiment Setup | Yes | The paper provides extensive details on the experimental setup, including model architecture (e.g., 24 layers, 32 attention heads, hidden size 2,048, context length 4096), learning rate schedule (warm-up for 10% of steps, cosine annealing, max 4e-4, min 4e-5), optimizer (Adam W with β1=0.9, β2=0.95, weight decay 5e-2), batch size (256 for pre-training, 512 for fine-tuning), and sequence length (4096 tokens) in Section 4.1.1, Section 4.2, Table 11, and Table 19. |