SkipGPT: Each Token is One of a Kind
Authors: Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Jing Xiong, Zhiwei Fei, Hui Su, Xiaoyu Shen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Skip GPT reduces over 40% model parameters while matching or exceeding the performance of the original dense model across benchmarks. |
| Researcher Affiliation | Collaboration | 1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo 2Southwest Jiaotong University 3Tencent Inc. 4Shanghai Jiao Tong University 5The University of Hong Kong 6Nanjing University 7Meituan Inc.. |
| Pseudocode | Yes | Algorithm 1 Training Process of Skip GPT |
| Open Source Code | Yes | Our code is publicly available at: https: //github.com/EIT-NLP/Skip GPT. |
| Open Datasets | Yes | During both router and Lo RA tuning, we use the Red Pajama-Data-1T-Sample dataset (Computer, 2023)3, which contains 850,000 samples (1 billion tokens) truncated to 4096 tokens each. This dataset serves two roles: (1) as a calibration set (100 random samples) to compute block-level significance for pruning redundant layers (static methods), and (2) as a training set for dynamic methods and for recovering static method performance (the specific details of static and dynamic methods will be introduced later in Section 4.2). Additionally, we report zero-shot PPL scores on Wiki Text2 (WT2) (Merity et al., 2016) and PTB (Marcus et al., 1993). |
| Dataset Splits | No | The paper uses Red Pajama-Data-1T-Sample dataset for training and tuning, and mentions a 'calibration set (100 random samples)'. It does not explicitly provide full train/validation/test splits for this dataset or how evaluation benchmarks (WT2, PTB) were specifically split for their experiments, besides implicitly using their zero-shot evaluation sets. |
| Hardware Specification | Yes | In practice, the tuning process requires only a single A800 (80GB) GPU and completes within four hours. |
| Software Dependencies | No | The paper does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions) used for the experiments. |
| Experiment Setup | Yes | Each model is trained for 10,000 steps with next-token prediction, using a batch size of 16 in both router and Lo RA tuning. In the router tuning stage, we use a constant learning rate of 2e-34. Additionally, the softmax temperature τ of the Gumbel-Softmax is linearly annealed from 5 to 1. In the Lo RA tuning stage, the learning rate is set to 2e-4 with a warmup ratio of 0.1 and a cosine learning rate scheduler. The Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95 is used for gradient backpropagation. The hyperparameter α controls the strength of the sparsity penalty in the overall loss function... a value of 8 works well. |