RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Authors: Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the LLa MA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and longcontext LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on Infinite Bench for LLa MA-3.1-8B) with more than 2 speedups for long-context inference.
Researcher Affiliation Collaboration 1National University of Singapore 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China. Correspondence to: Guanzheng Chen <EMAIL>, Michael Qizhe Shieh <EMAIL>.
Pseudocode Yes Algorithm 1 Retrieval-Augmented Speculative Decoding
Open Source Code Yes Code: https://github.com/NUS-TRAIL/RAPID
Open Datasets Yes Our RAPID can serve as a drop-in decoding method during long-context inference. We conduct experiments on LLa MA-3.1 (8B, 70B) (Dubey et al., 2024) and Qwen2.5 (7B, 72B) (Yang et al., 2024) series on Bench (Zhang et al.) and Long Bench v2 (Bai et al., 2024b).
Dataset Splits Yes We apply middle truncation following benchmark setup to ensure the context length within 128K tokens. ... We conduct efficiency evaluations using the Long Bench v2 (Long, Co T) subset, where each example involves 120K (tokens) context length after truncation and 1K maximum generation tokens.
Hardware Specification Yes For base-scale models (LLa MA-3.1-8B and Qwen2.5-7B), we evaluate RAPID s self-speculation capabilities against multiple baselines including naive Speculative Decoding, Magic Dec, Long Context (LC), and RAG implementations, using a single NVIDIA A800 80GB GPU. For large-scale models (LLa MA-3.1-70B and Qwen2.5-72B), self-speculation experiments are conducted using a distributed setup with 8 A800 80GB GPUs. In upward-speculation settings, we employ a hybrid configuration where the target models (LLa MA-3.1-8B/Qwen2.5-7B) operate on a single A800 80GB GPU, while leveraging an additional 7 A800 80GB GPUs to accommodate the larger RAG drafter.
Software Dependencies No The paper mentions models like LLa MA-3.1 and Qwen2.5, and tools like BGE-M3 for embedding, but does not provide specific version numbers for underlying software frameworks (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup Yes The RAG drafter generates γ = 10 tokens per step for target LLM verification. We search η in Eq. (6) between {5, 10, 20} for self-speculation and {40, 50} for upward-speculation, which would be further investigated in 4.5. ... The long context is segmented into 512-token chunks and embedded using BGE-M3 (Chen et al., 2024b). We retrieve top-k segments based on cosine similarity with the query embedding, filtering out segments below a 0.3 similarity threshold. The retrieval context length is bounded between 4096 tokens and 1/24 of the input length. ... We use temperature values of 1.0 and 0.1 for Bench and Long Bench v2, respectively.