Attention-Level Speculation

Authors: Jack Cai, Ammar Vora, Randolph Zhang, Mark O’Connor, Mark C. Jeffrey

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a novel form attention-level speculative parallelism (ALSpec) that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5 and improving end-to-end decode latency by up to 1.65, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models. We evaluate the implementation of Algorithm 2 with the speculative flash decode kernel to answer the following questions: (i) Does dynamic execution really improve performance on real device implementation? (ii) What rate of speculation hit rate do we see across benchmarks? (iii) Why is dynamic sometimes superior to static approximation? Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. ALSpec with λ {0.05, 0.10} achieves on par or better correctness on all evaluated tasks, with speculation hit rates ranging from 18% (Hotpot QA) to 90% (IFEval) depending on the task, and most speculation hit rates exceeding 50% for λ = 0.10.
Researcher Affiliation Collaboration 1University of Toronto 2Tenstorrent 3University of Waterloo. Correspondence to: Mark C. Jeffrey <EMAIL>.
Pseudocode Yes Algorithm 1 sketches the speculative execution procedure.
Open Source Code Yes 1Our code is at github.com/mcj-group/alspec
Open Datasets Yes We evaluate the correctness and speculation hit rate measurement at various choices of λ. We group the evaluations into categories of {Question Answering (QA), Information Retrieval (IR), Reasoning, Long Context, and Math}. We use the LM Evaluation Harness (Gao et al., 2024), a framework for few-shot language model evaluation, with the default prompts and few-shot templates. Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. ALSpec with λ {0.05, 0.10} achieves on par or better correctness on all evaluated tasks, with speculation hit rates ranging from 18% (Hotpot QA) to 90% (IFEval) depending on the task, and most speculation hit rates exceeding 50% for λ = 0.10.
Dataset Splits Yes We evaluate the correctness and speculation hit rate measurement at various choices of λ. We group the evaluations into categories of {Question Answering (QA), Information Retrieval (IR), Reasoning, Long Context, and Math}. We use the LM Evaluation Harness (Gao et al., 2024), a framework for few-shot language model evaluation, with the default prompts and few-shot templates.
Hardware Specification Yes Demonstrated on Tenstorrent s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models. We answer the previous questions with a case study of parallelizing Llama 3.1 8B model onto 8 Tenstorrent N150 chips. For correctness evaluations, we run the model in BF16 precision using an NVIDIA A100 or H100 GPUs; we simulate speculation by running both full and approximated attention (S = 128) and choosing the approximated attention output if verification succeeds.
Software Dependencies Yes We conducted our experiments on Tenstorrent N150 devices using the pre-release version of the TT-Metalium v0.55.0-rc13 software stack. We measure the Llama 3 8B decoding latency on 4 vs. 8 H100 GPUs doing TP using the SGLang (Zheng et al., 2024) serving framework and the Flash Infer (Ye et al., 2025) attention backend.
Experiment Setup Yes Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. We use a coarser version of attention sink that inputs the first and last S tokens from the KV cache. We choose S to be small relative to the context length (e.g., S {128, 256, 512} for a 128K context length). In Section 4.1, we choose S as the chunk size of speculative flash decode. The model is executed with default mixed precision configuration, with BF16 activations, BF8 KV cache, and BF{16,8,4} model weights.