Block-Attention for Efficient Prefilling
Authors: Dongyang Ma, Yan Wang, Tian Lan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 11 diverse benchmarks, including RAG, ICL, and general domains, demonstrate that after block fine-tuning, the Block-attention model not only achieves performance comparable to that of fullattention models, but can also seamlessly switch between the block and full attention modes without any performance loss. |
| Researcher Affiliation | Industry | Dongyang Ma Tencent EMAIL Yan Wang Tencent EMAIL |
| Pseudocode | No | The paper describes the steps for Block-attention implementation and position re-encoding using numbered lists and mathematical formulas, but does not present them in a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Codes, datasets and model weights have been publicly available at https://github.com/Temporary Lo RA/Block-attention. |
| Open Datasets | Yes | Evaluation Dataset We evaluate the performance of our proposed Block-attention mechanism and baseline models on four widely-used RAG benchmarks: Natural Questions (NQ) (Kwiatkowski et al., 2019), Trivia QA (TQA) (Joshi et al., 2017), Hotpot QA (HQA) (Yang et al., 2018), 2Wiki Multi Hop QA (2Wiki) (Ho et al., 2020), and Narrtive QA (NQA) (Koˇcisk y et al., 2017). ... In addition, we also evaluated the performance of the Block-attention model and the full-attention model in seven benchmarks of the general domain: MMLU (Hendrycks et al., 2021a), Big Bench Hard (BBH) (Suzgun et al., 2022), DROP (Dua et al., 2019), MATH (Hendrycks et al., 2021b), GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021), and IFEval (Zhou et al., 2023). |
| Dataset Splits | No | The paper mentions evaluating on several benchmarks but does not explicitly provide the training, test, or validation splits used for these datasets or how they were generated for the experiments. It states that 23% of Tulu3-SFT data was used for block fine-tuning, which is a portion of a training dataset, not a general split for evaluation or reproduction. |
| Hardware Specification | Yes | All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps. |
| Software Dependencies | No | The Deep Speed10 and Flash-Attention (Dao et al., 2022) toolkits are utilized to accelerate our training procedure using bfloat16 format. However, specific version numbers for these toolkits are not provided. |
| Experiment Setup | Yes | All experiments are conducted using 8 NVIDIA H20 GPUs with following hyper-parameters: (1) learning rate α = 2 10 5; (2) batch size b = 64; (3) epochs n = 1; and (4) 20 warmup steps. |