TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation on a diverse set of LLMs and tasks shows that Tidal Decode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1. Comprehensive evaluation with the Long Chat-7b-v1.5-32k, Llama-3-8B, Llama-3-70B, and Llama3.1-8B models on the Needle-in-the-Haystack, PG-19, and Long Bench tasks demonstrates that Tidal Decode can consistently achieve the best performance efficiency trade-off compared with the best existing sparse attention methods. We conduct extensive experiments to assess both the performance and efficiency of Tidal Decode. Our evaluations are performed on widely used open-source models, including Llama-27B Touvron et al. (2023) and Llama-3-8/70B. In Section 4.2, we evaluate Tidal Decode s performance on various tasks, including needle-in-the-haystack, language modeling on PG-19, and Long Bench. |
| Researcher Affiliation | Academia | Lijie Yang Carnegie Mellon University EMAIL Zhihao Zhang Carnegie Mellon University EMAIL Zhuofu Chen Carnegie Mellon University EMAIL Zikun Li Carnegie Mellon University EMAIL Zhihao Jia Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Algorithm 1 Tidal Decode 1: Input: Current embedding h, KV cache C, token budget m 2: Output: Logits 3: Initialize: ρ = [] Initialize the token buffer to store selected tokens 4: for each decoder layer i do 5: q, k, v = f(Wqkv, h) 6: C.append(k, v) 7: if i is Full Attention Layer then 8: o = Full Attention(q, C[:]) Dense attention with the full KVCache 9: else if i is Token Selection Layer then 10: o = Full Attention(q, C[:]) Dense attention with the full KVCache 11: K C.get Key, ρ := arg Top K( q, K , m) Update token buffer 12: else 13: o = Sparse Attention(q, C[ρ]) Sparse attention with the tokens in the token buffer 14: end if 15: h = FFN(o) 16: end for 17: logits = lm head(h) 18: return logits |
| Open Source Code | Yes | Code is available at https://github.com/DerrickYLJ/TidalDecode. |
| Open Datasets | Yes | Comprehensive evaluation with the Long Chat-7b-v1.5-32k, Llama-3-8B, Llama-3-70B, and Llama3.1-8B models on the Needle-in-the-Haystack, PG-19, and Long Bench tasks demonstrates that Tidal Decode can consistently achieve the best performance efficiency trade-off compared with the best existing sparse attention methods. We use the Llama-3-8B model and the needle-in-the-haystack test on the PG-19-mini dataset with a context length of 100K tokens. Perplexity measures the negative likelihood of how well a model predicts the next word in a sequence, with lower values indicating better performance. We evaluate Tidal Decode on Llama-3-8B-Instruct Gradient-1048k with the PG-19 dataset, which includes up to 100 books, providing a comprehensive long-context benchmark. We also evaluate Tidal Decode on Long Bench, a benchmark designed to test LLMs on long-context tasks across diverse NLP domains (Bai et al., 2023). |
| Dataset Splits | No | The paper mentions using well-known benchmark datasets like PG-19 and Long Bench, and conducting the Needle-in-the-Haystack test. For the Needle-in-the-Haystack test, it states: 'We arbitrarily select 100 requests from the dataset, insert needles to random depth, compute full attention, and analyze the correlation of attention scores patterns between different Transformer layers.' This describes how test instances are generated, but it does not provide explicit train/validation/test splits (e.g., percentages, sample counts, or citations to predefined splits) for the broader experimental setup, relying on the implied standard evaluation protocols of the benchmarks. |
| Hardware Specification | Yes | We conduct evaluation under the configuration of Llama-2-7B on one Nvidia A100 (80 GB HBM, SXM4) with CUDA 12.2. |
| Software Dependencies | Yes | We conduct evaluation under the configuration of Llama-2-7B on one Nvidia A100 (80 GB HBM, SXM4) with CUDA 12.2. |
| Experiment Setup | Yes | Empirical evidence shows that using just two token selection layers one at the beginning and one in the middle is sufficient to achieve high generative performance while minimizing computation and memory overheads. We use the Llama-3-8B model and the needle-in-the-haystack test on the PG-19-mini dataset with a context length of 100K tokens. In each test, we inserted a random password within the text and tested whether the specific method could retrieve the password correctly. Tidal Decode is compared against full-weight attention and Quest at token budgets of 1024 and 4096. For the 32-layer Llama model, we have 2 full attention layers + 2 token selection layers + 28 sparse attention layers, while Quest has 2 full attention layers + 30 Quest attention layers. For the 64-layer Llama model, we have 2 full attention layers + 2 token selection layers + 60 sparse attention layers, while Quest has 2 full attention layers + 62 Quest attention layers. |