XAttention: Block Sparse Attention with Antidiagonal Scoring
Authors: Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section presents our empirical investigation into the effectiveness of XAttention. We first detail the implementation specifics, followed by evaluation results on text and video understanding, as well as video generation benchmarks, against strong baselines. We then test the acceleration performance of XAttention. Finally, we provide analytical ablation studies to further understand the behavior of XAttention. |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Massachusetts Institute of Technology 3SJTU 4NVIDIA. Correspondence to: Guangxuan Xiao <EMAIL>, Song Han <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Block Selection |
| Open Source Code | Yes | https://github.com/mit-han-lab/x-attention |
| Open Datasets | Yes | For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset, a synthetic benchmark specifically designed to assess long-context abilities in LLMs. [...] We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023) to test performance in practical scenarios. For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset, [...] In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos. |
| Dataset Splits | No | We evaluate our model on a diverse set of tasks spanning natural language understanding, video understanding, and video generation. For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset... We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023)... For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset... In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos. |
| Hardware Specification | Yes | We thank NVIDIA for donating the DGX server. |
| Software Dependencies | No | Our primary baseline for dense attention is Flash Attention (Dao, 2023), implemented within the Flash Infer (Ye et al., 2024) framework. We also compare against MInference (Jiang et al., 2024), Flex Prefill (Lai et al., 2025), and Seer Attention (Gao et al., 2024), strictly adhering to their public implementations. |
| Experiment Setup | Yes | XAttention is configured with Stride S = 8 and S = 16 with Precisely Predicted Minimum Threshold. [...] We apply Stride S = 16 and threshold τ = 0.9 parameters on the Qwen VL-2-7B model. [...] We configure XAttention with a stride of S = 8 and thresholds of τ = 0.9 and τ = 0.95. [...] During this phase, we utilize full attention for the first 5 denoising steps, before switching to XAttention. [...] We set K = 8192 and Ratio = 27% for S = 8, and K = 16384 and Ratio = 31% for S = 16, targeting computational costs similar to our Threshold Block Selection. [...] Using Minimum Threshold Prediction, we start with τ = 0.9 and set M = 1000, allowing the dynamic programming (DP) algorithm to explore 1,000 optimal threshold combinations. This results in a set of more refined thresholds, with an average value of 0.8. |