reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Faster Cascades via Speculative Decoding

Authors: Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.
Researcher Affiliation	Industry	1Google Research, 2Google Deep Mind, 3Mistral AI EMAIL
Pseudocode	Yes	Algorithm 1 Spec Decode... Algorithm 2 Token Cascade... Algorithm 3 Oracle Cascade... Algorithm 4 Gen Spec Sample... Algorithm 5 Spec Cascade... Algorithm 6 Token Spec Cascade
Open Source Code	Yes	Illustrative colab notebook with Gemma models available at: https://github.com/google-research/google-research/tree/master/speculative_cascades.
Open Datasets	Yes	We construct these cascades using Chow s rule in (2), and the Diff rule in (5)... As a concrete example, we consider token-level cascades of T5 models (Raffel et al., 2020) of two different sizes ﬁnetuned on a WMT EN DE translation Bojar et al. (2014) and an extreme summarization (XSum) task (Narayan et al., 2018). ...To evaluate the Gemma model cascades, we use few-shot prompting with 8 language benchmarks: WMT, CNN/DM, GSM8K, MBPP, SQu AD 2.0, Web Questions, Natural QA and Trivia QA; many of these feature in the Spec Bench suite (Xia et al., 2024).
Dataset Splits	No	For the WMT English to German translation task (Bojar et al., 2014), we use a validation sample of size 3,000 provided with the dataset. We set the maximum input length to 80 and the maximum output length to 80. For the Extreme Summarization (XSum) task (Narayan et al., 2018), we use a validation sample of size 11,305, and set the maximum input length to 1,024 and the maximum output length to 64. For the CNN/Daily Mail summarization task (Hermann et al., 2015), we use a validation sample of size 13368, and set the maximum input length to 2,048 and the maximum output length to 128. ...In each case, we sample 1,000 prompts for evaluation.
Hardware Specification	Yes	All methods are run on the same TPUv4 device.
Software Dependencies	No	The paper mentions using T5 v1.1 family of encoder-decoder models and Gemma v2 decoder-only models, but does not provide specific version numbers for software dependencies like programming languages or libraries (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	For the T5 experiments, unless otherwise speciﬁed, we set the block-size γ to 5 for all methods that use speculative execution. For the token-level cascades, we allow the small model to predict for a maximum of 10 tokens (similar to (Kim et al., 2023)), before invoking the large model. ... We use temperatures T = 0, 0.1, 0.5, 1.0, and block sizes γ = 3, 5, 7 (full results in F). Following the protocol in Leviathan et al. (2023); Zhou et al. (2024), to measure latency, we evaluate the wall-clock decoding time with batch size 1. ... We employ few-shot inference, and set the maximum output length to 80 for WMT, to 128 for CNN/DM, to 320 for GSM8K and MBPP, and to 5 for all the question-answering datasets. ...we initialize with the public checkpoints, pre-train them further for 100K steps, and supervise ﬁnetune the pre-trained models on the three respective tasks. We ﬁnetune them for a maximum of 250K steps on WMT, a maximum of 100K steps on XSum and a maximum of 200K steps on CNNDM. ... When implementing speculative cascades and speculative decoding with Gemma models, we use block-size γ = 1. ... We pick β using a grid-search over 1000 values between α and 10.