reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

XAttention: Block Sparse Attention with Antidiagonal Scoring

Authors: Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents our empirical investigation into the effectiveness of XAttention. We first detail the implementation specifics, followed by evaluation results on text and video understanding, as well as video generation benchmarks, against strong baselines. We then test the acceleration performance of XAttention. Finally, we provide analytical ablation studies to further understand the behavior of XAttention.
Researcher Affiliation	Collaboration	1Tsinghua University 2Massachusetts Institute of Technology 3SJTU 4NVIDIA. Correspondence to: Guangxuan Xiao <EMAIL>, Song Han <EMAIL>.
Pseudocode	Yes	Algorithm 1 Block Selection
Open Source Code	Yes	https://github.com/mit-han-lab/x-attention
Open Datasets	Yes	For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset, a synthetic benchmark specifically designed to assess long-context abilities in LLMs. [...] We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023) to test performance in practical scenarios. For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset, [...] In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos.
Dataset Splits	No	We evaluate our model on a diverse set of tasks spanning natural language understanding, video understanding, and video generation. For natural language tasks, we employ the RULER (Hsieh et al., 2024) dataset... We also evaluate on real-world long-context tasks from Long Bench (Bai et al., 2023)... For video understanding, we utilize the Video-MME (Fu et al., 2024) dataset... In the video generation domain, we leverage 946 GPT-augmented text prompts from VBench (Huang et al., 2024) to generate videos.
Hardware Specification	Yes	We thank NVIDIA for donating the DGX server.
Software Dependencies	No	Our primary baseline for dense attention is Flash Attention (Dao, 2023), implemented within the Flash Infer (Ye et al., 2024) framework. We also compare against MInference (Jiang et al., 2024), Flex Prefill (Lai et al., 2025), and Seer Attention (Gao et al., 2024), strictly adhering to their public implementations.
Experiment Setup	Yes	XAttention is configured with Stride S = 8 and S = 16 with Precisely Predicted Minimum Threshold. [...] We apply Stride S = 16 and threshold τ = 0.9 parameters on the Qwen VL-2-7B model. [...] We configure XAttention with a stride of S = 8 and thresholds of τ = 0.9 and τ = 0.95. [...] During this phase, we utilize full attention for the first 5 denoising steps, before switching to XAttention. [...] We set K = 8192 and Ratio = 27% for S = 8, and K = 16384 and Ratio = 31% for S = 16, targeting computational costs similar to our Threshold Block Selection. [...] Using Minimum Threshold Prediction, we start with τ = 0.9 and set M = 1000, allowing the dynamic programming (DP) algorithm to explore 1,000 optimal threshold combinations. This results in a set of more refined thresholds, with an average value of 0.8.