SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Authors: Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3 1.6 speedup while preserving the original distribution of the generated text. |
| Researcher Affiliation | Collaboration | Heming Xia1, Yongqi Li1 , Jun Zhang2, Cunxiao Du3, Wenjie Li1 1Department of Computing, The Hong Kong Polytechnic University 2College of Computer Science and Technology, Zhejiang University 3Sea AI Lab |
| Pseudocode | No | The paper describes the proposed method, SWIFT, through textual descriptions and illustrative figures (e.g., Figure 3, Figure 4, Figure 5) but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | We release our code in https://github.com/hemingkx/SWIFT. |
| Open Datasets | Yes | We conducted experiments using LLa MA-2-13B across the CNN/Daily Mail (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), and Tiny Stories (Eldan & Li, 2023) datasets. ... The evaluation datasets include CNN/Daily Mail (CNN/DM) (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), Tiny Stories (Eldan & Li, 2023), and Human Eval (Chen et al., 2021). |
| Dataset Splits | Yes | We randomly sample 1000 instances from the test set for each dataset except Human Eval. The maximum generation lengths for Human Eval and all analyses are set to 512. ... We perform 1-shot evaluation for CNN/DM and Tiny Stories, and 5-shot evaluation for GSM8K. |
| Hardware Specification | Yes | All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores. |
| Software Dependencies | Yes | All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores. Inference for our method and all baselines was performed using the Huggingface transformers package. |
| Experiment Setup | Yes | The context window γ is set to 32. The maximum draft length ND is set to 25. For random sampling in code generation tasks, we apply a temperature of 0.6 and top p = 0.95. The maximum number of layer set optimization steps S is set to 1000, with Bayesian optimization performed every β = 25 steps. The optimization phase is set to be early stopped if the matchness score does not improve after 300 steps or exceeds 0.95. The layer skip ratio r is fixed at 0.45 for the 13B model and 0.5 for the 34B and 70B models. |