reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attention-Level Speculation

Authors: Jack Cai, Ammar Vora, Randolph Zhang, Mark O’Connor, Mark C. Jeffrey

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a novel form attention-level speculative parallelism (ALSpec) that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5 and improving end-to-end decode latency by up to 1.65, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models. We evaluate the implementation of Algorithm 2 with the speculative flash decode kernel to answer the following questions: (i) Does dynamic execution really improve performance on real device implementation? (ii) What rate of speculation hit rate do we see across benchmarks? (iii) Why is dynamic sometimes superior to static approximation? Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. ALSpec with λ {0.05, 0.10} achieves on par or better correctness on all evaluated tasks, with speculation hit rates ranging from 18% (Hotpot QA) to 90% (IFEval) depending on the task, and most speculation hit rates exceeding 50% for λ = 0.10.
Researcher Affiliation	Collaboration	1University of Toronto 2Tenstorrent 3University of Waterloo. Correspondence to: Mark C. Jeffrey <EMAIL>.
Pseudocode	Yes	Algorithm 1 sketches the speculative execution procedure.
Open Source Code	Yes	1Our code is at github.com/mcj-group/alspec
Open Datasets	Yes	We evaluate the correctness and speculation hit rate measurement at various choices of λ. We group the evaluations into categories of {Question Answering (QA), Information Retrieval (IR), Reasoning, Long Context, and Math}. We use the LM Evaluation Harness (Gao et al., 2024), a framework for few-shot language model evaluation, with the default prompts and few-shot templates. Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. ALSpec with λ {0.05, 0.10} achieves on par or better correctness on all evaluated tasks, with speculation hit rates ranging from 18% (Hotpot QA) to 90% (IFEval) depending on the task, and most speculation hit rates exceeding 50% for λ = 0.10.
Dataset Splits	Yes	We evaluate the correctness and speculation hit rate measurement at various choices of λ. We group the evaluations into categories of {Question Answering (QA), Information Retrieval (IR), Reasoning, Long Context, and Math}. We use the LM Evaluation Harness (Gao et al., 2024), a framework for few-shot language model evaluation, with the default prompts and few-shot templates.
Hardware Specification	Yes	Demonstrated on Tenstorrent s NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models. We answer the previous questions with a case study of parallelizing Llama 3.1 8B model onto 8 Tenstorrent N150 chips. For correctness evaluations, we run the model in BF16 precision using an NVIDIA A100 or H100 GPUs; we simulate speculation by running both full and approximated attention (S = 128) and choosing the approximated attention output if verification succeeds.
Software Dependencies	Yes	We conducted our experiments on Tenstorrent N150 devices using the pre-release version of the TT-Metalium v0.55.0-rc13 software stack. We measure the Llama 3 8B decoding latency on 4 vs. 8 H100 GPUs doing TP using the SGLang (Zheng et al., 2024) serving framework and the Flash Infer (Ye et al., 2025) attention backend.
Experiment Setup	Yes	Table 1 shows the output quality and speculation hit rate for λ {0.05, 0.10, 0.15, 0.20, 0.25} compared to baseline. We use a coarser version of attention sink that inputs the first and last S tokens from the KV cache. We choose S to be small relative to the context length (e.g., S {128, 256, 512} for a 128K context length). In Section 4.1, we choose S as the chunk size of speculative flash decode. The model is executed with default mixed precision configuration, with BF16 activations, BF8 KV cache, and BF{16,8,4} model weights.