Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

Authors: Ashish Khisti, MohammadReza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical analysis also motives a new class of token-level selection schemes based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.
Researcher Affiliation Collaboration Ashish Khisti12 M.Reza Ebrahimi1 Hassan Dbouk1 Arash Behboodi1 Roland Memisevic1 Christos Louizos1 1Qualcomm AI Research 2University of Toronto Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
Pseudocode Yes Algorithm 1 Speculative Sampling Algorithm 2 Truncated LP
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use the OPT models (Zhang et al., 2022), where the draft model has 125 million parameters and the target model has 13B parameters. For evaluation purposes we consider the datasets associated with the XSum (Narayan et al., 2018), Databricks-Dolly-15k (Conover et al., 2023) and the WMT18 (Bojar et al., 2018) tasks.
Dataset Splits Yes For evaluation purposes we consider the datasets associated with the XSum (Narayan et al., 2018), Databricks-Dolly-15k (Conover et al., 2023) and the WMT18 (Bojar et al., 2018) tasks.
Hardware Specification Yes We conduct experiments using an instance of A100 GPU with 80GB memory.
Software Dependencies No The paper mentions software like 'OPT models' but does not specify any software libraries or frameworks with their version numbers.
Experiment Setup Yes We set the temperature of the target model to 1.0, and one that of the draft models to 1.2 while we vary the temperature of the other draft model between the range of 1.0 to 2.4. In all our experiments we generate 5 tokens per call of the draft model. In the IS scheme we employ both truncated LP (with s = 5 as the truncation parameter) and truncated alphabet (to a size of 40 tokens) as discussed in section 4.