Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Authors: Jingyu Liu, Beidi Chen, Ce Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SPECPREFILL with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies.
Researcher Affiliation Academia 1Department of Computer Science, The University of Chicago, Chicago, IL, USA 2Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Jingyu Liu <EMAIL>, Ce Zhang <EMAIL>.
Pseudocode Yes In Algorithm 1, we list the high-level steps of conducting SPECPREFILL.
Open Source Code Yes 1The code with experiment reproduction is available at https: //github.com/anonymous/speculative_prefill.
Open Datasets Yes We start with long context tasks using Long Bench (Bai et al., 2024)... RULER (Hsieh et al., 2024)... We select tasks spanning general knowledge (Generative MMLU (Hendrycks et al., 2021) and Instruction Following Evaluation (Zhou et al., 2023)), math (GSM8K 8 Shots (Cobbe et al., 2021)), coding (Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)), and reasoning abilities (Arc Challenge (Clark et al., 2018) and GPQA 8 Shots (Rein et al., 2023)).
Dataset Splits No For standard short task evaluation in Sec 4.6, we use LM-EVAL-HARNESS (Gao et al., 2024a) and EVAL-PLUS (Liu et al., 2023).
Hardware Specification Yes We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model across either 8 NVIDIA H100s or H200s (full system specification in Appendix D and guidance on reproducing results in Appendix C).
Software Dependencies Yes When running all experiments in v LLM (0.6.3.post1), we set enforce eager=True and enable chunked prefill=False to avoid any unexpected behaviors.
Experiment Setup Yes We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model... We choose LLAMA-3.1-8B-INSTRUCT (Grattafiori et al., 2024) with BF16 precision as our speculator... and couple it with either Llama-3.1-70B-Instruct (BF16) or Llama-3.1-405B-Instruct-FP83 (fully quantized FP8) as the base model. In terms of token keep rate, we use a fixed percentage (i.e. the ratio of chunks when we do chunk selection) for a given task.