reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference

Authors: Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The effectiveness of Ada Skip is demonstrated through extensive experiments on various long-context benchmarks and models, showcasing its superior inference performance over existing baselines. The paper includes tables such as "Evaluation of different skipping strategies" (Table 3 and 4) which present quantitative results for Doc QA, Few-shot Learning, and Text Summarization tasks, including F1, ACC, and Rouge-L scores, and speedup metrics.
Researcher Affiliation	Collaboration	The authors are affiliated with "1Shanghai Key Laboratory of Scalable Computing and Systems, Shanghai Jiao Tong University" (academia), "2Huawei Cloud" (industry), and "3School of Computing, National University of Singapore" (academia), indicating a collaboration between academic institutions and an industry entity.
Pseudocode	No	The paper describes the methodology, including "Sublayer Skipping during Prefilling with Offline Importance Learning" and "Extra FFN Sublayer Skipping during Decoding with Online Importance Learning," using descriptive text. However, it does not contain any structured pseudocode blocks or algorithms explicitly labeled as such.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	The paper explicitly names and cites several well-known datasets used for experiments: "Multi Field QA (Bai et al. 2023), Trivia QA (Joshi et al. 2017), and TREC (Li and Roth 2002) as prefilling tasks" and "Gov Report (Huang et al. 2021) and Multi News (Fabbri et al. 2019) as decoding tasks."
Dataset Splits	No	The paper mentions categorizing benchmarks into prefilling and decoding tasks with specified average input/output lengths. It also describes a method for online learning using 'initial few tokens' (window sizes) from sequences within datasets. However, it does not provide specific details regarding training, validation, and test splits (e.g., percentages, sample counts, or references to standard splits for model evaluation) for the datasets used in the main experiments.
Hardware Specification	Yes	The paper explicitly states the hardware used for experiments: "A single L20 GPU with CUDA version 12.1 is used as the testbed."
Software Dependencies	Yes	The paper specifies one software dependency with a version number: "CUDA version 12.1."
Experiment Setup	Yes	The paper provides specific experimental setup details, including: "an acceleration ratio, α, as a knob to control this trade-off," the calculation for "the number of sublayers to be skipped, m," defining "the first P decoded tokens as online learning windows," deriving "a threshold β," and specifying "Three of the latest and widely adopted long-context LLMs are tested: LLa MA3.1-8B-128k, Intern LM-7B-8k, and Vicuna-v1.5-7B-16k." It also mentions "output lengths capped at 32" and "output lengths limited to 512."