ReAttention: Training-Free Infinite Context with Finite Attention Scope
Authors: Xiaoran Liu, Ruixiao Li, Zhigeng Liu, Qipeng Guo, Yuerong Song, Kai Lv, Hang Yan, Linlin Li, Qun Liu, Xipeng Qiu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the performance of Re Attention on the Long Bench, L-Eval, and Infinite Bench and demonstrate that it is on par with traditional methods. Furthermore, we also apply Re Attention on mainstream LLMs, including LLa MA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLa MA3.2-3B-chat by 128 to 4M without any further training in Needle-In-A-Haystack tests. We conduct experiments on LLa MA3-8B-8K (Meta, 2024a), LLa MA3.1-8B-128K (Dubey et al., 2024), LLa MA3.1-70B-128K (Dubey et al., 2024), LLa MA3.2-3B-128K (Dubey et al., 2024), Mistralv0.3-7B-32K (mistralai, 2024), Intern LM2.5-7B-1M (Intern LM, 2024), Qwen2-7B-128K (Yang et al., 2024a), Qwen2-72B-128K (Yang et al., 2024a), Qwen2-1B-32K (Yang et al., 2024a). |
| Researcher Affiliation | Collaboration | 1School of Computer Science, Fudan University, 2Huawei Noah s Ark Lab, 3Shanghai AI Lab, 4Shanghai Innovation Institute |
| Pseudocode | Yes | The pseudocode of the whole process is detailed in Appendix A. (See Appendix A for Algorithm 1: Prefilling Phase and Algorithm 2: Decoding Phase). |
| Open Source Code | Yes | The code is available at https://github.com/Open MOSS/Re Attention. |
| Open Datasets | Yes | We validate the performance of Re Attention on the Long Bench, L-Eval, and Infinite Bench and demonstrate that it is on par with traditional methods. Furthermore, we also apply Re Attention on mainstream LLMs, including LLa MA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLa MA3.2-3B-chat by 128 to 4M without any further training in Needle-In-A-Haystack tests. |
| Dataset Splits | Yes | We first evaluate all 9 LLMs on the commonly used long-context benchmark Long Bench (Bai et al., 2023) and L-Eval (An et al., 2023), with a default context length of 32K and a middle truncation. We validate our method on Infinite Bench (Zhang et al., 2024c), a more challenging benchmark with a longer context length. We choose 3 commonly tested subtasks, En.MC, En.QA and En.Sum, evaluate models with varying context lengths. |
| Hardware Specification | Yes | We perform experiments on 8 A100 GPUs and extend the context lengths of LLMs with Re Attention to at least 1M tokens. All experiments were conducted on a system with a 48-core CPU, 256GB RAM, and an A800-80GB GPU. |
| Software Dependencies | Yes | All experiments are performed with FP16 precision and accelerated with Flash Attention2 (Dao, 2023). we use Triton (Tillet et al., 2019), a GPU programming language, to minimize read and write overheads in top-k attention. |
| Experiment Setup | Yes | For all models, we set the length of Kglobal to 32, the length of Klocal to 4096, and selected span size to 32. Moreover, we set k = 4, k = 127 in top-k attention. Importantly, the attention scope in each step remains within the maximum attention window. For example, for the LLa MA3-8B-8K with Re Attention, the maximum attention scope size is 32 + 4096 + 127 × 32, which exactly matches the maximum supported attention window of 8192. We use Open Compass (Contributors, 2023b) for validation. All experiments are performed with FP16 precision and accelerated with Flash Attention2 (Dao, 2023). |