AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Authors: Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of Ada Skip is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines. The paper includes tables such as "Evaluation of different skipping strategies" (Table 3 and 4) which present quantitative results for Doc QA, Few-shot Learning, and Text Summarization tasks, including F1, ACC, and Rouge-L scores, and speedup metrics.
Researcher Affiliation Collaboration The authors are affiliated with "1Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University" (academia), "2Huawei Cloud" (industry), and "3School of Computing, National University of Singapore" (academia), indicating a collaboration between academic institutions and an industry entity.
Pseudocode No The paper describes the methodology, including "Sublayer Skipping during Prefilling with Offline Importance Learning" and "Extra FFN Sublayer Skipping during Decoding with Online Importance Learning," using descriptive text. However, it does not contain any structured pseudocode blocks or algorithms explicitly labeled as such.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes The paper explicitly names and cites several well-known datasets used for experiments: "Multi Field QA (Bai et al. 2023), Trivia QA (Joshi et al. 2017), and TREC (Li and Roth 2002) as prefilling tasks" and "Gov Report (Huang et al. 2021) and Multi News (Fabbri et al. 2019) as decoding tasks."
Dataset Splits No The paper mentions categorizing benchmarks into prefilling and decoding tasks with specified average input/output lengths. It also describes a method for online learning using 'initial few tokens' (window sizes) from sequences within datasets. However, it does not provide specific details regarding training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits for model evaluation) for the datasets used in the main experiments.
Hardware Specification Yes The paper explicitly states the hardware used for experiments: "A single L20 GPU with CUDA version 12.1 is used as the testbed."
Software Dependencies Yes The paper specifies one software dependency with a version number: "CUDA version 12.1."
Experiment Setup Yes The paper provides specific experimental setup details, including: "an acceleration ratio, α, as a knob to control this trade-off," the calculation for "the number of sublayers to be skipped, m," defining "the first P decoded tokens as online learning windows," deriving "a threshold β," and specifying "Three of the latest and widely adopted long-context LLMs are tested: LLa MA3.1-8B-128k, Intern LM-7B-8k, and Vicuna-v1.5-7B-16k." It also mentions "output lengths capped at 32" and "output lengths limited to 512."