SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Authors: Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3 1.6 speedup while preserving the original distribution of the generated text.
Researcher Affiliation Collaboration Heming Xia1, Yongqi Li1 , Jun Zhang2, Cunxiao Du3, Wenjie Li1 1Department of Computing, The Hong Kong Polytechnic University 2College of Computer Science and Technology, Zhejiang University 3Sea AI Lab
Pseudocode No The paper describes the proposed method, SWIFT, through textual descriptions and illustrative figures (e.g., Figure 3, Figure 4, Figure 5) but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes We release our code in https://github.com/hemingkx/SWIFT.
Open Datasets Yes We conducted experiments using LLa MA-2-13B across the CNN/Daily Mail (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), and Tiny Stories (Eldan & Li, 2023) datasets. ... The evaluation datasets include CNN/Daily Mail (CNN/DM) (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), Tiny Stories (Eldan & Li, 2023), and Human Eval (Chen et al., 2021).
Dataset Splits Yes We randomly sample 1000 instances from the test set for each dataset except Human Eval. The maximum generation lengths for Human Eval and all analyses are set to 512. ... We perform 1-shot evaluation for CNN/DM and Tiny Stories, and 5-shot evaluation for GSM8K.
Hardware Specification Yes All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores.
Software Dependencies Yes All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores. Inference for our method and all baselines was performed using the Huggingface transformers package.
Experiment Setup Yes The context window γ is set to 32. The maximum draft length ND is set to 25. For random sampling in code generation tasks, we apply a temperature of 0.6 and top p = 0.95. The maximum number of layer set optimization steps S is set to 1000, with Bayesian optimization performed every β = 25 steps. The optimization phase is set to be early stopped if the matchness score does not improve after 300 steps or exceeds 0.95. The layer skip ratio r is fixed at 0.45 for the 13B model and 0.5 for the 34B and 70B models.