reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Star Attention: Efficient LLM Inference over Long Sequences

Authors: Shantanu Acharya, Fei Jia, Boris Ginsburg

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Star Attention for Llama3.1-8B and Llama3.1-70B (Meta-AI, 2024) on several long-context benchmarks. Star Attention achieves up to 11 times faster inference while maintaining 97-100% of the baseline accuracy. [...] We empirically evaluate Star Attention using several Llama-based models across multiple long-context benchmarks with sequence lengths ranging from 16K to 1M tokens, assessing both its accuracy and inference speedup relative to established baselines.
Researcher Affiliation	Industry	1NVIDIA. Correspondence to: Shantanu Acharya <EMAIL>, Fei Jia <EMAIL>.
Pseudocode	Yes	A. Star Attention Pseudo-code Algorithm 1 Star Attention Phase 1: Context Encoding Algorithm 2 Star Attention Phase 2: Query Encoding and Token Generation
Open Source Code	Yes	1Code: https://github.com/NVIDIA/Star-Attention
Open Datasets	Yes	We evaluate our method on three benchmarks, each testing unique aspects of long context understanding: (i) RULER (Hsieh et al., 2024): a synthetic benchmark with 13 tasks categorized into 4 domains: Needle-in-a-Haystack (Retrieval), Multi-Hop Tracing, Aggregation, and Question Answering. (ii) BABILong (Kuratov et al., 2024): a benchmark of 5 tasks requiring reasoning over multiple supporting facts encoded in the context to generate accurate answers. (iii) Infinite Bench (Zhang et al., 2024): a diverse collection of 10 real-world and synthetic tasks spanning summarization, multilingual QA, code debugging, and retrieval.
Dataset Splits	Yes	We evaluate our method on three benchmarks... (i) RULER (Hsieh et al., 2024)... Each task comprises 500 samples. (ii) BABILong (Kuratov et al., 2024)... each containing a 1000 samples. (iii) Infinite Bench (Zhang et al., 2024)...
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. [...] The results indicate that both Ring and Star Attention can process sequences up to 128K tokens on 8 A100 GPUs...
Software Dependencies	No	We implement Star Attention in both Hugging Face Transformers library (Wolf et al., 2020) and NVIDIA s TRT-LLM framework (NVIDIA, 2023). Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. Specific version numbers for these software components are not provided.
Experiment Setup	Yes	All experiments are conducted on NVIDIA A100 GPUs with bfloat16 precision. Optimization techniques such as Flash Attention are applied uniformly across Star and Ring Attention implementations to ensure a fair comparison. [...] In each setting, the context and the anchor block size are set to one-quarter of the total sequence length. [...] For sequence lengths exceeding 128K, we fix the block size at 32K tokens to prioritize inference speed.