Faster Cascades via Speculative Decoding
Authors: Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost-quality trade-offs than cascading and speculative decoding baselines. |
| Researcher Affiliation | Industry | 1Google Research, 2Google Deep Mind, 3Mistral AI EMAIL |
| Pseudocode | Yes | Algorithm 1 Spec Decode... Algorithm 2 Token Cascade... Algorithm 3 Oracle Cascade... Algorithm 4 Gen Spec Sample... Algorithm 5 Spec Cascade... Algorithm 6 Token Spec Cascade |
| Open Source Code | Yes | Illustrative colab notebook with Gemma models available at: https://github.com/google-research/google-research/tree/master/speculative_cascades. |
| Open Datasets | Yes | We construct these cascades using Chow s rule in (2), and the Diff rule in (5)... As a concrete example, we consider token-level cascades of T5 models (Raffel et al., 2020) of two different sizes finetuned on a WMT EN DE translation Bojar et al. (2014) and an extreme summarization (XSum) task (Narayan et al., 2018). ...To evaluate the Gemma model cascades, we use few-shot prompting with 8 language benchmarks: WMT, CNN/DM, GSM8K, MBPP, SQu AD 2.0, Web Questions, Natural QA and Trivia QA; many of these feature in the Spec Bench suite (Xia et al., 2024). |
| Dataset Splits | No | For the WMT English to German translation task (Bojar et al., 2014), we use a validation sample of size 3,000 provided with the dataset. We set the maximum input length to 80 and the maximum output length to 80. For the Extreme Summarization (XSum) task (Narayan et al., 2018), we use a validation sample of size 11,305, and set the maximum input length to 1,024 and the maximum output length to 64. For the CNN/Daily Mail summarization task (Hermann et al., 2015), we use a validation sample of size 13368, and set the maximum input length to 2,048 and the maximum output length to 128. ...In each case, we sample 1,000 prompts for evaluation. |
| Hardware Specification | Yes | All methods are run on the same TPUv4 device. |
| Software Dependencies | No | The paper mentions using T5 v1.1 family of encoder-decoder models and Gemma v2 decoder-only models, but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | For the T5 experiments, unless otherwise specified, we set the block-size γ to 5 for all methods that use speculative execution. For the token-level cascades, we allow the small model to predict for a maximum of 10 tokens (similar to (Kim et al., 2023)), before invoking the large model. ... We use temperatures T = 0, 0.1, 0.5, 1.0, and block sizes γ = 3, 5, 7 (full results in F). Following the protocol in Leviathan et al. (2023); Zhou et al. (2024), to measure latency, we evaluate the wall-clock decoding time with batch size 1. ... We employ few-shot inference, and set the maximum output length to 80 for WMT, to 128 for CNN/DM, to 320 for GSM8K and MBPP, and to 5 for all the question-answering datasets. ...we initialize with the public checkpoints, pre-train them further for 100K steps, and supervise finetune the pre-trained models on the three respective tasks. We finetune them for a maximum of 250K steps on WMT, a maximum of 100K steps on XSum and a maximum of 200K steps on CNNDM. ... When implementing speculative cascades and speculative decoding with Gemma models, we use block-size γ = 1. ... We pick β using a grid-search over 1000 values between α and 10. |