Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lostin-the-middle scenarios, outperforming NTP by 11.77% percentage points. To evaluate the effectiveness of MEAP in training LLMs, we conduct controlled experiments comparing LLMs pretrained/fine-tuned by MEAP with those trained by NTP.
Researcher Affiliation Collaboration 1School of Artificial Intelligence, University of Chinese Academy of Sciences, China 2SCITIX (SGP) TECH PTE. LTD., Singapore 3South China Normal University, China 4University of Texas at Austin, USA 5Sun Yat Sen University, China 6University of Oxford, UK.
Pseudocode No The paper describes the MEAP algorithm in Section 3 "Mask-Enhanced Autoregressive Prediction" using formal notation and descriptive text, including mathematical formulas for next-token prediction, but it does not present it within a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/scitix/MEAP.
Open Datasets Yes For key information retrieval, we choose the wellestablished Needle-in-a-Haystack evaluation (Liu et al., 2024b)...
Dataset Splits No The paper uses several benchmark datasets and evaluation frameworks (e.g., LM Eval Harness, Needle-in-a-Haystack, Multi-Document Question Answering) which imply predefined evaluation protocols, but it does not explicitly state the training, validation, and test splits used by the authors for their experiments in terms of percentages or sample counts. For the contextual hallucination evaluation, it mentions using "100 random samples per dataset" but not dataset splits for training/testing.
Hardware Specification No The paper mentions training parameters and model architecture details but does not specify the exact hardware components (e.g., GPU models, CPU models, or specific cloud instance types) used for the experiments.
Software Dependencies No The paper mentions using the Adam W optimizer, Deep Speed Zero Stage 2, and the Llama-3 tokenizer, but does not provide specific version numbers for these software components or other key libraries like PyTorch or HuggingFace Transformers, which are typically required for reproducibility.
Experiment Setup Yes The paper provides extensive details on the experimental setup, including model architecture (e.g., 24 layers, 32 attention heads, hidden size 2,048, context length 4096), learning rate schedule (warm-up for 10% of steps, cosine annealing, max 4e-4, min 4e-5), optimizer (Adam W with β1=0.9, β2=0.95, weight decay 5e-2), batch size (256 for pre-training, 512 for fine-tuning), and sequence length (4096 tokens) in Section 4.1.1, Section 4.2, Table 11, and Table 19.