XAttention: Block Sparse Attention with Antidiagonal Scoring

Authors: Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents our empirical investigation into the effectiveness of XAttention. We first detail the implementation specifics, followed by evaluation results on text and video understanding, as well as video generation benchmarks, against strong baselines. We then test the acceleration performance of XAttention. Finally, we provide analytical ablation studies to further understand the behavior of XAttention.
Researcher Affiliation Collaboration 1Tsinghua University 2Massachusetts Institute of Technology 3SJTU 4NVIDIA. Correspondence to: Guangxuan Xiao <EMAIL>, Song Han <EMAIL>.
Pseudocode Yes Algorithm 1 Block Selection
Open Source Code Yes https://github.com/mit-han-lab/x-attention
Open Datasets Yes For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset, a synthetic benchmark specifically designed to assess long-context abilities in LLMs. [...] We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023) to test performance in practical scenarios. For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset, [...] In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos.
Dataset Splits No We evaluate our model on a diverse set of tasks spanning natural language understanding, video understanding, and video generation. For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset... We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023)... For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset... In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos.
Hardware Specification Yes We thank NVIDIA for donating the DGX server.
Software Dependencies No Our primary baseline for dense attention is Flash Attention (Dao, 2023), implemented within the Flash Infer (Ye et al., 2024) framework. We also compare against MInference (Jiang et al., 2024), Flex Prefill (Lai et al., 2025), and Seer Attention (Gao et al., 2024), strictly adhering to their public implementations.
Experiment Setup Yes XAttention is configured with Stride S = 8 and S = 16 with Precisely Predicted Minimum Threshold. [...] We apply Stride S = 16 and threshold τ = 0.9 parameters on the Qwen VL-2-7B model. [...] We configure XAttention with a stride of S = 8 and thresholds of τ = 0.9 and τ = 0.95. [...] During this phase, we utilize full attention for the first 5 denoising steps, before switching to XAttention. [...] We set K = 8192 and Ratio = 27% for S = 8, and K = 16384 and Ratio = 31% for S = 16, targeting computational costs similar to our Threshold Block Selection. [...] Using Minimum Threshold Prediction, we start with τ = 0.9 and set M = 1000, allowing the dynamic programming (DP) algorithm to explore 1,000 optimal threshold combinations. This results in a set of more refined thresholds, with an average value of 0.8.