Star Attention: Efficient LLM Inference over Long Sequences
Authors: Shantanu Acharya, Fei Jia, Boris Ginsburg
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Star Attention for Llama3.1-8B and Llama3.1-70B (Meta-AI, 2024) on several long-context benchmarks. Star Attention achieves up to 11 times faster inference while maintaining 97-100% of the baseline accuracy. [...] We empirically evaluate Star Attention using several Llama-based models across multiple long-context benchmarks with sequence lengths ranging from 16K to 1M tokens, assessing both its accuracy and inference speedup relative to established baselines. |
| Researcher Affiliation | Industry | 1NVIDIA. Correspondence to: Shantanu Acharya <EMAIL>, Fei Jia <EMAIL>. |
| Pseudocode | Yes | A. Star Attention Pseudo-code Algorithm 1 Star Attention Phase 1: Context Encoding Algorithm 2 Star Attention Phase 2: Query Encoding and Token Generation |
| Open Source Code | Yes | 1Code: https://github.com/NVIDIA/Star-Attention |
| Open Datasets | Yes | We evaluate our method on three benchmarks, each testing unique aspects of long context understanding: (i) RULER (Hsieh et al., 2024): a synthetic benchmark with 13 tasks categorized into 4 domains: Needle-in-a-Haystack (Retrieval), Multi-Hop Tracing, Aggregation, and Question Answering. (ii) BABILong (Kuratov et al., 2024): a benchmark of 5 tasks requiring reasoning over multiple supporting facts encoded in the context to generate accurate answers. (iii) Infinite Bench (Zhang et al., 2024): a diverse collection of 10 real-world and synthetic tasks spanning summarization, multilingual QA, code debugging, and retrieval. |
| Dataset Splits | Yes | We evaluate our method on three benchmarks... (i) RULER (Hsieh et al., 2024)... Each task comprises 500 samples. (ii) BABILong (Kuratov et al., 2024)... each containing a 1000 samples. (iii) Infinite Bench (Zhang et al., 2024)... |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. [...] The results indicate that both Ring and Star Attention can process sequences up to 128K tokens on 8 A100 GPUs... |
| Software Dependencies | No | We implement Star Attention in both Hugging Face Transformers library (Wolf et al., 2020) and NVIDIA s TRT-LLM framework (NVIDIA, 2023). Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. Specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. [...] In each setting, the context and the anchor block size are set to one-quarter of the total sequence length. [...] For sequence lengths exceeding 128K, we fix the block size at 32K tokens to prioritize inference speed. |