reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Authors: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on a diverse set of LLMs and tasks shows that Tidal Decode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1. Comprehensive evaluation with the Long Chat-7b-v1.5-32k, Llama-3-8B, Llama-3-70B, and Llama3.1-8B models on the Needle-in-the-Haystack, PG-19, and Long Bench tasks demonstrates that Tidal Decode can consistently achieve the best performance efficiency trade-off compared with the best existing sparse attention methods. We conduct extensive experiments to assess both the performance and efficiency of Tidal Decode. Our evaluations are performed on widely used open-source models, including Llama-27B Touvron et al. (2023) and Llama-3-8/70B. In Section 4.2, we evaluate Tidal Decode s performance on various tasks, including needle-in-the-haystack, language modeling on PG-19, and Long Bench.
Researcher Affiliation	Academia	Lijie Yang Carnegie Mellon University EMAIL Zhihao Zhang Carnegie Mellon University EMAIL Zhuofu Chen Carnegie Mellon University EMAIL Zikun Li Carnegie Mellon University EMAIL Zhihao Jia Carnegie Mellon University EMAIL
Pseudocode	Yes	Algorithm 1 Tidal Decode 1: Input: Current embedding h, KV cache C, token budget m 2: Output: Logits 3: Initialize: ρ = [] Initialize the token buffer to store selected tokens 4: for each decoder layer i do 5: q, k, v = f(Wqkv, h) 6: C.append(k, v) 7: if i is Full Attention Layer then 8: o = Full Attention(q, C[:]) Dense attention with the full KVCache 9: else if i is Token Selection Layer then 10: o = Full Attention(q, C[:]) Dense attention with the full KVCache 11: K C.get Key, ρ := arg Top K( q, K , m) Update token buffer 12: else 13: o = Sparse Attention(q, C[ρ]) Sparse attention with the tokens in the token buffer 14: end if 15: h = FFN(o) 16: end for 17: logits = lm head(h) 18: return logits
Open Source Code	Yes	Code is available at https://github.com/DerrickYLJ/TidalDecode.
Open Datasets	Yes	Comprehensive evaluation with the Long Chat-7b-v1.5-32k, Llama-3-8B, Llama-3-70B, and Llama3.1-8B models on the Needle-in-the-Haystack, PG-19, and Long Bench tasks demonstrates that Tidal Decode can consistently achieve the best performance efficiency trade-off compared with the best existing sparse attention methods. We use the Llama-3-8B model and the needle-in-the-haystack test on the PG-19-mini dataset with a context length of 100K tokens. Perplexity measures the negative likelihood of how well a model predicts the next word in a sequence, with lower values indicating better performance. We evaluate Tidal Decode on Llama-3-8B-Instruct Gradient-1048k with the PG-19 dataset, which includes up to 100 books, providing a comprehensive long-context benchmark. We also evaluate Tidal Decode on Long Bench, a benchmark designed to test LLMs on long-context tasks across diverse NLP domains (Bai et al., 2023).
Dataset Splits	No	The paper mentions using well-known benchmark datasets like PG-19 and Long Bench, and conducting the Needle-in-the-Haystack test. For the Needle-in-the-Haystack test, it states: 'We arbitrarily select 100 requests from the dataset, insert needles to random depth, compute full attention, and analyze the correlation of attention scores patterns between different Transformer layers.' This describes how test instances are generated, but it does not provide explicit train/validation/test splits (e.g., percentages, sample counts, or citations to predefined splits) for the broader experimental setup, relying on the implied standard evaluation protocols of the benchmarks.
Hardware Specification	Yes	We conduct evaluation under the configuration of Llama-2-7B on one Nvidia A100 (80 GB HBM, SXM4) with CUDA 12.2.
Software Dependencies	Yes	We conduct evaluation under the configuration of Llama-2-7B on one Nvidia A100 (80 GB HBM, SXM4) with CUDA 12.2.
Experiment Setup	Yes	Empirical evidence shows that using just two token selection layers one at the beginning and one in the middle is sufficient to achieve high generative performance while minimizing computation and memory overheads. We use the Llama-3-8B model and the needle-in-the-haystack test on the PG-19-mini dataset with a context length of 100K tokens. In each test, we inserted a random password within the text and tested whether the specific method could retrieve the password correctly. Tidal Decode is compared against full-weight attention and Quest at token budgets of 1024 and 4096. For the 32-layer Llama model, we have 2 full attention layers + 2 token selection layers + 28 sparse attention layers, while Quest has 2 full attention layers + 30 Quest attention layers. For the 64-layer Llama model, we have 2 full attention layers + 2 token selection layers + 60 sparse attention layers, while Quest has 2 full attention layers + 62 Quest attention layers.