SkipGPT: Each Token is One of a Kind

Authors: Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Jing Xiong, Zhiwei Fei, Hui Su, Xiaoyu Shen

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Skip GPT reduces over 40% model parameters while matching or exceeding the performance of the original dense model across benchmarks.
Researcher Affiliation Collaboration 1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo 2Southwest Jiaotong University 3Tencent Inc. 4Shanghai Jiao Tong University 5The University of Hong Kong 6Nanjing University 7Meituan Inc..
Pseudocode Yes Algorithm 1 Training Process of Skip GPT
Open Source Code Yes Our code is publicly available at: https: //github.com/EIT-NLP/Skip GPT.
Open Datasets Yes During both router and Lo RA tuning, we use the Red Pajama-Data-1T-Sample dataset (Computer, 2023)3, which contains 850,000 samples (1 billion tokens) truncated to 4096 tokens each. This dataset serves two roles: (1) as a calibration set (100 random samples) to compute block-level significance for pruning redundant layers (static methods), and (2) as a training set for dynamic methods and for recovering static method performance (the specific details of static and dynamic methods will be introduced later in Section 4.2). Additionally, we report zero-shot PPL scores on Wiki Text2 (WT2) (Merity et al., 2016) and PTB (Marcus et al., 1993).
Dataset Splits No The paper uses Red Pajama-Data-1T-Sample dataset for training and tuning, and mentions a 'calibration set (100 random samples)'. It does not explicitly provide full train/validation/test splits for this dataset or how evaluation benchmarks (WT2, PTB) were specifically split for their experiments, besides implicitly using their zero-shot evaluation sets.
Hardware Specification Yes In practice, the tuning process requires only a single A800 (80GB) GPU and completes within four hours.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions) used for the experiments.
Experiment Setup Yes Each model is trained for 10,000 steps with next-token prediction, using a batch size of 16 in both router and Lo RA tuning. In the router tuning stage, we use a constant learning rate of 2e-34. Additionally, the softmax temperature τ of the Gumbel-Softmax is linearly annealed from 5 to 1. In the Lo RA tuning stage, the learning rate is set to 2e-4 with a warmup ratio of 0.1 and a cosine learning rate scheduler. The Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95 is used for gradient backpropagation. The hyperparameter α controls the strength of the sparsity penalty in the overall loss function... a value of 8 works well.