Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
Authors: Jingyu Liu, Beidi Chen, Ce Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SPECPREFILL with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies. |
| Researcher Affiliation | Academia | 1Department of Computer Science, The University of Chicago, Chicago, IL, USA 2Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Jingyu Liu <EMAIL>, Ce Zhang <EMAIL>. |
| Pseudocode | Yes | In Algorithm 1, we list the high-level steps of conducting SPECPREFILL. |
| Open Source Code | Yes | 1The code with experiment reproduction is available at https: //github.com/anonymous/speculative_prefill. |
| Open Datasets | Yes | We start with long context tasks using Long Bench (Bai et al., 2024)... RULER (Hsieh et al., 2024)... We select tasks spanning general knowledge (Generative MMLU (Hendrycks et al., 2021) and Instruction Following Evaluation (Zhou et al., 2023)), math (GSM8K 8 Shots (Cobbe et al., 2021)), coding (Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)), and reasoning abilities (Arc Challenge (Clark et al., 2018) and GPQA 8 Shots (Rein et al., 2023)). |
| Dataset Splits | No | For standard short task evaluation in Sec 4.6, we use LM-EVAL-HARNESS (Gao et al., 2024a) and EVAL-PLUS (Liu et al., 2023). |
| Hardware Specification | Yes | We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model across either 8 NVIDIA H100s or H200s (full system specification in Appendix D and guidance on reproducing results in Appendix C). |
| Software Dependencies | Yes | When running all experiments in v LLM (0.6.3.post1), we set enforce eager=True and enable chunked prefill=False to avoid any unexpected behaviors. |
| Experiment Setup | Yes | We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model... We choose LLAMA-3.1-8B-INSTRUCT (Grattafiori et al., 2024) with BF16 precision as our speculator... and couple it with either Llama-3.1-70B-Instruct (BF16) or Llama-3.1-405B-Instruct-FP83 (fully quantized FP8) as the base model. In terms of token keep rate, we use a fixed percentage (i.e. the ratio of chunks when we do chunk selection) for a given task. |