reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Authors: Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments across a wide range of models and downstream tasks demonstrate that SWIFT can achieve over a 1.3 1.6 speedup while preserving the original distribution of the generated text.
Researcher Affiliation	Collaboration	Heming Xia1, Yongqi Li1 , Jun Zhang2, Cunxiao Du3, Wenjie Li1 1Department of Computing, The Hong Kong Polytechnic University 2College of Computer Science and Technology, Zhejiang University 3Sea AI Lab
Pseudocode	No	The paper describes the proposed method, SWIFT, through textual descriptions and illustrative figures (e.g., Figure 3, Figure 4, Figure 5) but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	We release our code in https://github.com/hemingkx/SWIFT.
Open Datasets	Yes	We conducted experiments using LLa MA-2-13B across the CNN/Daily Mail (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), and Tiny Stories (Eldan & Li, 2023) datasets. ... The evaluation datasets include CNN/Daily Mail (CNN/DM) (Nallapati et al., 2016), GSM8K (Cobbe et al., 2021), Tiny Stories (Eldan & Li, 2023), and Human Eval (Chen et al., 2021).
Dataset Splits	Yes	We randomly sample 1000 instances from the test set for each dataset except Human Eval. The maximum generation lengths for Human Eval and all analyses are set to 512. ... We perform 1-shot evaluation for CNN/DM and Tiny Stories, and 5-shot evaluation for GSM8K.
Hardware Specification	Yes	All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores.
Software Dependencies	Yes	All experiments were conducted using Pytorch 2.1.0 on 4 NVIDIA RTX A6000 GPU (40GB) with CUDA 12.1, and an Intel(R) Xeon(R) Platinum 8370C CPU with 32 cores. Inference for our method and all baselines was performed using the Huggingface transformers package.
Experiment Setup	Yes	The context window γ is set to 32. The maximum draft length ND is set to 25. For random sampling in code generation tasks, we apply a temperature of 0.6 and top p = 0.95. The maximum number of layer set optimization steps S is set to 1000, with Bayesian optimization performed every β = 25 steps. The optimization phase is set to be early stopped if the matchness score does not improve after 300 steps or exceeds 0.95. The layer skip ratio r is fixed at 0.45 for the 13B model and 0.5 for the 34B and 70B models.