reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Authors: Xialie Zhuang, Zhikai Jia, Jianjin Li, Zhenyu Zhang, Li Shen, Zheng Cao, Shiwei Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lostin-the-middle scenarios, outperforming NTP by 11.77% percentage points. To evaluate the effectiveness of MEAP in training LLMs, we conduct controlled experiments comparing LLMs pretrained/fine-tuned by MEAP with those trained by NTP.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, University of Chinese Academy of Sciences, China 2SCITIX (SGP) TECH PTE. LTD., Singapore 3South China Normal University, China 4University of Texas at Austin, USA 5Sun Yat Sen University, China 6University of Oxford, UK.
Pseudocode	No	The paper describes the MEAP algorithm in Section 3 "Mask-Enhanced Autoregressive Prediction" using formal notation and descriptive text, including mathematical formulas for next-token prediction, but it does not present it within a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/scitix/MEAP.
Open Datasets	Yes	For key information retrieval, we choose the wellestablished Needle-in-a-Haystack evaluation (Liu et al., 2024b)...
Dataset Splits	No	The paper uses several benchmark datasets and evaluation frameworks (e.g., LM Eval Harness, Needle-in-a-Haystack, Multi-Document Question Answering) which imply predefined evaluation protocols, but it does not explicitly state the training, validation, and test splits used by the authors for their experiments in terms of percentages or sample counts. For the contextual hallucination evaluation, it mentions using "100 random samples per dataset" but not dataset splits for training/testing.
Hardware Specification	No	The paper mentions training parameters and model architecture details but does not specify the exact hardware components (e.g., GPU models, CPU models, or specific cloud instance types) used for the experiments.
Software Dependencies	No	The paper mentions using the Adam W optimizer, Deep Speed Zero Stage 2, and the Llama-3 tokenizer, but does not provide specific version numbers for these software components or other key libraries like PyTorch or HuggingFace Transformers, which are typically required for reproducibility.
Experiment Setup	Yes	The paper provides extensive details on the experimental setup, including model architecture (e.g., 24 layers, 32 attention heads, hidden size 2,048, context length 4096), learning rate schedule (warm-up for 10% of steps, cosine annealing, max 4e-4, min 4e-5), optimizer (Adam W with β1=0.9, β2=0.95, weight decay 5e-2), batch size (256 for pre-training, 512 for fine-tuning), and sequence length (4096 tokens) in Section 4.1.1, Section 4.2, Table 11, and Table 19.