Block-Attention for Efficient Prefilling

Authors: Dongyang Ma, Yan Wang, Tian Lan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of fullattention models, but can also seamlessly switch between the block and full attention modes without any performance loss.
Researcher Affiliation Industry Dongyang Ma Tencent EMAIL Yan Wang Tencent EMAIL
Pseudocode No The paper describes the steps for Block-attention implementation and position re-encoding using numbered lists and mathematical formulas, but does not present them in a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Codes, datasets and model weights have been publicly available at https://github.com/Temporary Lo RA/Block-attention.
Open Datasets Yes Evaluation Dataset We evaluate the performance of our proposed Block-attention mechanism and baseline models on four widely-used RAG benchmarks: Natural Questions (NQ) (Kwiatkowski et al., 2019), Trivia QA (TQA) (Joshi et al., 2017), Hotpot QA (HQA) (Yang et al., 2018), 2Wiki Multi Hop QA (2Wiki) (Ho et al., 2020), and Narrtive QA (NQA) (Koˇcisk y et al., 2017). ... In addition, we also evaluated the performance of the Block-attention model and the full-attention model in seven benchmarks of the general domain: MMLU (Hendrycks et al., 2021a), Big Bench Hard (BBH) (Suzgun et al., 2022), DROP (Dua et al., 2019), MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), and IFEval (Zhou et al., 2023).
Dataset Splits No The paper mentions evaluating on several benchmarks but does not explicitly provide the training, test, or validation splits used for these datasets or how they were generated for the experiments. It states that 23% of Tulu3-SFT data was used for block fine-tuning, which is a portion of a training dataset, not a general split for evaluation or reproduction.
Hardware Specification Yes All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps.
Software Dependencies No The Deep Speed10 and Flash-Attention (Dao et al., 2022) toolkits are utilized to accelerate our training procedure using bfloat16 format. However, specific version numbers for these toolkits are not provided.
Experiment Setup Yes All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps.