TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Authors: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that TOKENSWIFT achieves over 3ˆ speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths.
Researcher Affiliation Collaboration 1State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China 2LUMIA Lab, Shanghai Jiao Tong University. Correspondence to: Zilong Zheng <EMAIL>. The affiliations include a 'State Key Laboratory' and 'BIGAI', along with 'Shanghai Jiao Tong University', which is a known academic institution. This combination of a state-funded research lab/entity and a university indicates a collaborative affiliation.
Pseudocode Yes In summary, the overall flow of our framework is presented in Algorithm 1.
Open Source Code No The paper does not contain any explicit statements or links indicating that the code for the methodology described in this paper is open-source or publicly available.
Open Datasets Yes The inference experiments are performed on the test set of PG-19 (Rae et al., 2020). ... We train the model on Wikipedia (20231101.en) 5 and part of C4-en6 for 1 epoch. 5https://huggingface.co/datasets/wikimedia/wikipedia 6https://huggingface.co/datasets/allenai/c4
Dataset Splits No The inference experiments are performed on the test set of PG-19 (Rae et al., 2020). ... We train linear layers in Section 3.2 using the first 8K tokens of training data, for datasets longer than 8K tokens, from PG-19 (Rae et al., 2020). While the paper mentions using a 'test set' and 'training data' from PG-19, it does not specify the exact split percentages, sample counts, or the methodology (e.g., random seed, stratified split) used to create these splits for reproducibility.
Hardware Specification Yes Inference is performed on a single NVIDIA A100-SXM4-80GB. ... The model was trained on an NVIDIA A100-SXM4-80GB GPU.
Software Dependencies No optimizer Adam W (Table 10). While an optimizer is mentioned, no version numbers for this or any other software libraries or frameworks are provided.
Experiment Setup Yes The number of extra decoding heads is set to 3 across all models. ... Table 10. Additional training details. ... optimizer Adam W betas (0.9, 0.999) weight decay 0.1 warmup steps 50 learning rate scheduler cosine num. GPUs 4 gradient accumulation steps 10 ... Table 11. k stands for the maximum number of retrieved n-grams in token reutilization k temp. top-p min-p penalty penalty len. LLaMA3.1-8b 20 1.0 0.9 0.05 1.2 1024