Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
Authors: Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schoenfeld, Ali Thabet, Jonas Kohler
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase our strategy on the Llama-3.1 family, where our 8b/405B-Judge achieves a speedup of 9 over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to 141 tokens/s for 8B/70B-Judge and 129 tokens/s for 8B/405B on 2 and 8 H100s respectively. Our results are summarized in Table 1. |
| Researcher Affiliation | Collaboration | Gregor Bachmann , , Sotiris Anagnostidis , Albert Pumarola Markos Georgopoulos Artsiom Sanakoyeu Yuming Du Edgar Sch onfeld Ali Thabet Jonas Kohler Work done during an internship at Meta Gen AI. Meta Gen AI. ETH Z urich. Correspondence to: EMAIL |
| Pseudocode | No | The paper describes the verification process and the judge decoding method using prose and mathematical equations (e.g., Equation 4), and illustrates concepts with figures. It does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper references the 'gpt-fast' framework (Pytorch-Team, 2023) and provides a GitHub link for it: 'https://github.com/pytorch-labs/gpt-fast'. However, this is a third-party framework used for benchmarking, not the authors' own implementation code for the 'Judge Decoding' methodology described in the paper. There is no explicit statement or link provided for the source code of the authors' work. |
| Open Datasets | Yes | We conduct our experiments on several benchmarks including GSM8K (Cobbe et al., 2021), Human Eval (Chen et al., 2021) and MT-Bench (Zheng et al., 2023)... we further include multiple-choice benchmarks ARC (Clark et al., 2018) and MMLU (Hendrycks et al., 2021)... Using a subset of the wikipedia-summary dataset (Scheepers, 2017)... The set of input prompts are a mixture of newly-created questions and two public datasets that we heavily filtered (Alpaca (Taori et al., 2023) and ARC (Clark et al., 2018)). |
| Dataset Splits | No | To give a more complete picture, we further include multiple-choice benchmarks ARC (Clark et al., 2018) and MMLU (Hendrycks et al., 2021), which are atypical tasks for standard SD as only a few tokens need to be produced, but further serves as a check that our verification scheme does not degrade performance. We use the prompting templates from Dubey et al. (2024). We tune all hyperparameters on a small test split. In total we collected 500 high-quality question, correct answer, wrong answer tuples. |
| Hardware Specification | Yes | All 70B (405B) models run on 2 (8) H100 GPUs... We run all of our experiments on a single node of H100-SXM5 GPUs. For Llama-405B we use 8 GPUs and 8-bit quantization to ensure that the model fits on a single node. For Llama-70B, we use again 8-bit quantization but only 2 GPUs. |
| Software Dependencies | No | The paper mentions running benchmarks in 'Hugging Face (Wolf et al., 2020)' and the 'gpt-fast framework (Pytorch-Team, 2023)'. It also states 'We train our linear heads using the Adam W optimizer (Loshchilov & Hutter, 2019)'. However, it does not provide specific version numbers for these software components or any other libraries used for the implementation of their method. |
| Experiment Setup | Yes | We train our linear heads using the Adam W optimizer (Loshchilov & Hutter, 2019) with learning rate η = 0.0001, weight decay 0.1 and batch size 128. We tune all hyperparameters on a small test split. We experiment with embeddings from several layers... For simplicity, we thus stick to using the last embedding of the target before the RMS normalization (Zhang & Sennrich, 2019) and the language modelling (LM) head. ... we leave it at the natural value δ = 0.5. |