DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Authors: Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Duo Attention on both long-context and short-context benchmarks to demonstrate that our method preserves model performance on tasks requiring both long and short contexts while significantly improving efficiency. For long-context evaluations, we use the Needle-in-a-Haystack (NIAH) benchmark (Kamradt, 2024) and Long Bench (Bai et al., 2023). For short-context evaluations, we assess performance on MMLU (Hendrycks et al., 2021), MBPP (Austin et al., 2021), and MT-Bench (Zheng et al., 2023). We employ state-of-the-art opensource models, including Llama-2-7B-chat (Touvron et al., 2023b) (and its long-context variant Llama-2-7B-32K-Instruct (Together, 2023)), Llama-3-[8,70]B-Instruct (and its long-context variant Llama-3-8B-Instruct-Gradient-1048k *), and Mistral-7B-v0.2-Instruct (Jiang et al., 2023).
Researcher Affiliation Collaboration Guangxuan Xiao1 Jiaming Tang1 Jingwei Zuo2 Junxian Guo1,3 Shang Yang1 Haotian Tang1 Yao Fu4 Song Han1,5 1 MIT 2 Tsinghua University 3 SJTU 4University of Edinburgh 5 NVIDIA
Pseudocode No The paper describes the methodology with textual explanations and figures, such as Figure 2 depicting an "Overview of Duo Attention" and sections like "Optimization-based Identification," but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Code is provided in the link. https://github.com/mit-han-lab/duo-attention
Open Datasets Yes For long-context evaluations, we use the Needle-in-a-Haystack (NIAH) benchmark (Kamradt, 2024) and Long Bench (Bai et al., 2023). For short-context evaluations, we assess performance on MMLU (Hendrycks et al., 2021), MBPP (Austin et al., 2021), and MT-Bench (Zheng et al., 2023). We use one-shot prompting for MMLU and zero-shot prompting for MBPP and MT-Bench. For retrieval head identification, we use a batch size of 1, inserting ten 32-word passkeys into the Book Sum (Kry sci nski et al., 2021) dataset.
Dataset Splits No The paper uses well-known benchmarks like Needle-in-a-Haystack, Long Bench, MMLU, MBPP, and MT-Bench, which typically have standard splits. However, the paper does not explicitly state the train/test/validation splits used for model training or evaluation in the main text beyond the descriptions for the synthetic dataset used for retrieval head identification and evaluation protocols (one-shot/zero-shot prompting). For the synthetic dataset: "Training samples are drawn from 50 intervals ranging from 1,000 tokens to the model-specific maximum length. Passkeys are randomly inserted at 1000 points within the context." For NIAH, it specifies how the testing is conducted: "Insertion depth varies across 10 levels: 0%, 11%, . . . , 100% of the corpus length. Context size varies across 13 context sizes."
Hardware Specification Yes All training experiments in our paper can be conducted on 8 NVIDIA A100 GPU servers. We evaluate Duo Attention s decoding latency and memory usage on Llama-2-7B and Llama-3-8B models on a single NVIDIA A100 GPU. Combining quantization techniques with Duo Attention allows us to accommodate up to 3.30 million tokens on a single A100-80G GPU using the Llama-3-8B model.
Software Dependencies No We implement Duo Attention in Py Torch (Paszke et al., 2019) using Ro PE (Su et al., 2021) and RMSNorm kernels from Flash Infer (Ye et al., 2024). We optimize gate values using the Adam W (Kingma & Ba, 2015) optimizer. The paper mentions software tools and frameworks but does not provide specific version numbers for PyTorch, Flash Infer, or other libraries, which are necessary for full reproducibility.
Experiment Setup Yes For retrieval head identification, we use a batch size of 1, inserting ten 32-word passkeys into the Book Sum (Kry sci nski et al., 2021) dataset. The identification process uses 128 sink tokens and 256 recent tokens. Training samples are drawn from 50 intervals ranging from 1,000 tokens to the model-specific maximum length. Passkeys are randomly inserted at 1000 points within the context. We optimize gate values using the Adam W (Kingma & Ba, 2015) optimizer, starting with a learning rate of 0.02, warming up from 0.002 in the first 400 steps, and reducing back to 0.002 in the final 400 steps. All experiments run for 2,000 steps on NVIDIA A100 GPUs. The final training loss is a combination of the distillation loss and the regularization loss, weighted by a hyperparameter λ, which we set as 0.05 in our experiments. We configure Duo Attention with a 25% retrieval head ratio for Llama-2-7B-32K-Instruct and a 50% ratio for Llama-3-8B-Instruct-Gradient-1048k. We use 64 sink, 256 recent tokens, and 32,000 pre-filling chunk size for Duo Attention. For short-context evaluations, we configure 32 sink tokens and 128 recent tokens on MMLU, and 16 sink tokens and 64 recent tokens on MBPP and MT-Bench.