reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

Authors: Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the LLa MA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and longcontext LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on Infinite Bench for LLa MA-3.1-8B) with more than 2 speedups for long-context inference.
Researcher Affiliation	Collaboration	1National University of Singapore 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China. Correspondence to: Guanzheng Chen <EMAIL>, Michael Qizhe Shieh <EMAIL>.
Pseudocode	Yes	Algorithm 1 Retrieval-Augmented Speculative Decoding
Open Source Code	Yes	Code: https://github.com/NUS-TRAIL/RAPID
Open Datasets	Yes	Our RAPID can serve as a drop-in decoding method during long-context inference. We conduct experiments on LLa MA-3.1 (8B, 70B) (Dubey et al., 2024) and Qwen2.5 (7B, 72B) (Yang et al., 2024) series on Bench (Zhang et al.) and Long Bench v2 (Bai et al., 2024b).
Dataset Splits	Yes	We apply middle truncation following benchmark setup to ensure the context length within 128K tokens. ... We conduct efficiency evaluations using the Long Bench v2 (Long, Co T) subset, where each example involves 120K (tokens) context length after truncation and 1K maximum generation tokens.
Hardware Specification	Yes	For base-scale models (LLa MA-3.1-8B and Qwen2.5-7B), we evaluate RAPID s self-speculation capabilities against multiple baselines including naive Speculative Decoding, Magic Dec, Long Context (LC), and RAG implementations, using a single NVIDIA A800 80GB GPU. For large-scale models (LLa MA-3.1-70B and Qwen2.5-72B), self-speculation experiments are conducted using a distributed setup with 8 A800 80GB GPUs. In upward-speculation settings, we employ a hybrid configuration where the target models (LLa MA-3.1-8B/Qwen2.5-7B) operate on a single A800 80GB GPU, while leveraging an additional 7 A800 80GB GPUs to accommodate the larger RAG drafter.
Software Dependencies	No	The paper mentions models like LLa MA-3.1 and Qwen2.5, and tools like BGE-M3 for embedding, but does not provide specific version numbers for underlying software frameworks (e.g., PyTorch, TensorFlow) or programming languages.
Experiment Setup	Yes	The RAG drafter generates γ = 10 tokens per step for target LLM verification. We search η in Eq. (6) between {5, 10, 20} for self-speculation and {40, 50} for upward-speculation, which would be further investigated in 4.5. ... The long context is segmented into 512-token chunks and embedded using BGE-M3 (Chen et al., 2024b). We retrieve top-k segments based on cosine similarity with the query embedding, filtering out segments below a 0.3 similarity threshold. The retrieval context length is bounded between 4096 tokens and 1/24 of the input length. ... We use temperature values of 1.0 and 0.1 for Bench and Long Bench v2, respectively.