reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Block-Attention for Efficient Prefilling

Authors: Dongyang Ma, Yan Wang, Tian Lan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of fullattention models, but can also seamlessly switch between the block and full attention modes without any performance loss.
Researcher Affiliation	Industry	Dongyang Ma Tencent EMAIL Yan Wang Tencent EMAIL
Pseudocode	No	The paper describes the steps for Block-attention implementation and position re-encoding using numbered lists and mathematical formulas, but does not present them in a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Codes, datasets and model weights have been publicly available at https://github.com/Temporary Lo RA/Block-attention.
Open Datasets	Yes	Evaluation Dataset We evaluate the performance of our proposed Block-attention mechanism and baseline models on four widely-used RAG benchmarks: Natural Questions (NQ) (Kwiatkowski et al., 2019), Trivia QA (TQA) (Joshi et al., 2017), Hotpot QA (HQA) (Yang et al., 2018), 2Wiki Multi Hop QA (2Wiki) (Ho et al., 2020), and Narrtive QA (NQA) (Koˇcisk y et al., 2017). ... In addition, we also evaluated the performance of the Block-attention model and the full-attention model in seven benchmarks of the general domain: MMLU (Hendrycks et al., 2021a), Big Bench Hard (BBH) (Suzgun et al., 2022), DROP (Dua et al., 2019), MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), and IFEval (Zhou et al., 2023).
Dataset Splits	No	The paper mentions evaluating on several benchmarks but does not explicitly provide the training, test, or validation splits used for these datasets or how they were generated for the experiments. It states that 23% of Tulu3-SFT data was used for block fine-tuning, which is a portion of a training dataset, not a general split for evaluation or reproduction.
Hardware Specification	Yes	All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps.
Software Dependencies	No	The Deep Speed10 and Flash-Attention (Dao et al., 2022) toolkits are utilized to accelerate our training procedure using bfloat16 format. However, specific version numbers for these toolkits are not provided.
Experiment Setup	Yes	All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps.