Star Attention: Efficient LLM Inference over Long Sequences

Authors: Shantanu Acharya, Fei Jia, Boris Ginsburg

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Star Attention for Llama3.1-8B and Llama3.1-70B (Meta-AI, 2024) on several long-context benchmarks. Star Attention achieves up to 11 times faster inference while maintaining 97-100% of the baseline accuracy. [...] We empirically evaluate Star Attention using several Llama-based models across multiple long-context benchmarks with sequence lengths ranging from 16K to 1M tokens, assessing both its accuracy and inference speedup relative to established baselines.
Researcher Affiliation Industry 1NVIDIA. Correspondence to: Shantanu Acharya <EMAIL>, Fei Jia <EMAIL>.
Pseudocode Yes A. Star Attention Pseudo-code Algorithm 1 Star Attention Phase 1: Context Encoding Algorithm 2 Star Attention Phase 2: Query Encoding and Token Generation
Open Source Code Yes 1Code: https://github.com/NVIDIA/Star-Attention
Open Datasets Yes We evaluate our method on three benchmarks, each testing unique aspects of long context understanding: (i) RULER (Hsieh et al., 2024): a synthetic benchmark with 13 tasks categorized into 4 domains: Needle-in-a-Haystack (Retrieval), Multi-Hop Tracing, Aggregation, and Question Answering. (ii) BABILong (Kuratov et al., 2024): a benchmark of 5 tasks requiring reasoning over multiple supporting facts encoded in the context to generate accurate answers. (iii) Infinite Bench (Zhang et al., 2024): a diverse collection of 10 real-world and synthetic tasks spanning summarization, multilingual QA, code debugging, and retrieval.
Dataset Splits Yes We evaluate our method on three benchmarks... (i) RULER (Hsieh et al., 2024)... Each task comprises 500 samples. (ii) BABILong (Kuratov et al., 2024)... each containing a 1000 samples. (iii) Infinite Bench (Zhang et al., 2024)...
Hardware Specification Yes All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. [...] The results indicate that both Ring and Star Attention can process sequences up to 128K tokens on 8 A100 GPUs...
Software Dependencies No We implement Star Attention in both Hugging Face Transformers library (Wolf et al., 2020) and NVIDIA s TRT-LLM framework (NVIDIA, 2023). Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. Specific version numbers for these software components are not provided.
Experiment Setup Yes All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. [...] In each setting, the context and the anchor block size are set to one-quarter of the total sequence length. [...] For sequence lengths exceeding 128K, we fix the block size at 32K tokens to prioritize inference speed.