Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits
Authors: Ashish Khisti, MohammadReza Ebrahimi, Hassan Dbouk, Arash Behboodi, Roland Memisevic, Christos Louizos
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our theoretical analysis also motives a new class of token-level selection schemes based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios. |
| Researcher Affiliation | Collaboration | Ashish Khisti12 M.Reza Ebrahimi1 Hassan Dbouk1 Arash Behboodi1 Roland Memisevic1 Christos Louizos1 1Qualcomm AI Research 2University of Toronto Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. |
| Pseudocode | Yes | Algorithm 1 Speculative Sampling Algorithm 2 Truncated LP |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use the OPT models (Zhang et al., 2022), where the draft model has 125 million parameters and the target model has 13B parameters. For evaluation purposes we consider the datasets associated with the XSum (Narayan et al., 2018), Databricks-Dolly-15k (Conover et al., 2023) and the WMT18 (Bojar et al., 2018) tasks. |
| Dataset Splits | Yes | For evaluation purposes we consider the datasets associated with the XSum (Narayan et al., 2018), Databricks-Dolly-15k (Conover et al., 2023) and the WMT18 (Bojar et al., 2018) tasks. |
| Hardware Specification | Yes | We conduct experiments using an instance of A100 GPU with 80GB memory. |
| Software Dependencies | No | The paper mentions software like 'OPT models' but does not specify any software libraries or frameworks with their version numbers. |
| Experiment Setup | Yes | We set the temperature of the target model to 1.0, and one that of the draft models to 1.2 while we vary the temperature of the other draft model between the range of 1.0 to 2.4. In all our experiments we generate 5 tokens per call of the draft model. In the IS scheme we employ both truncated LP (with s = 5 as the truncation parameter) and truncated alphabet (to a size of 40 tokens) as discussed in section 4. |