reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Authors: Jingyu Liu, Beidi Chen, Ce Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SPECPREFILL with a diverse set of tasks, followed by a comprehensive benchmarking of performance improvement both in a real end-to-end setting and ablation studies.
Researcher Affiliation	Academia	1Department of Computer Science, The University of Chicago, Chicago, IL, USA 2Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Jingyu Liu <EMAIL>, Ce Zhang <EMAIL>.
Pseudocode	Yes	In Algorithm 1, we list the high-level steps of conducting SPECPREFILL.
Open Source Code	Yes	1The code with experiment reproduction is available at https: //github.com/anonymous/speculative_prefill.
Open Datasets	Yes	We start with long context tasks using Long Bench (Bai et al., 2024)... RULER (Hsieh et al., 2024)... We select tasks spanning general knowledge (Generative MMLU (Hendrycks et al., 2021) and Instruction Following Evaluation (Zhou et al., 2023)), math (GSM8K 8 Shots (Cobbe et al., 2021)), coding (Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021)), and reasoning abilities (Arc Challenge (Clark et al., 2018) and GPQA 8 Shots (Rein et al., 2023)).
Dataset Splits	No	For standard short task evaluation in Sec 4.6, we use LM-EVAL-HARNESS (Gao et al., 2024a) and EVAL-PLUS (Liu et al., 2023).
Hardware Specification	Yes	We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model across either 8 NVIDIA H100s or H200s (full system specification in Appendix D and guidance on reproducing results in Appendix C).
Software Dependencies	Yes	When running all experiments in v LLM (0.6.3.post1), we set enforce eager=True and enable chunked prefill=False to avoid any unexpected behaviors.
Experiment Setup	Yes	We run all of experiments using a tensor parallelism of 8 for both the speculator and the base model... We choose LLAMA-3.1-8B-INSTRUCT (Grattafiori et al., 2024) with BF16 precision as our speculator... and couple it with either Llama-3.1-70B-Instruct (BF16) or Llama-3.1-405B-Instruct-FP83 (fully quantized FP8) as the base model. In terms of token keep rate, we use a fixed percentage (i.e. the ratio of chunks when we do chunk selection) for a given task.