reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SkipGPT: Each Token is One of a Kind

Authors: Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Jing Xiong, Zhiwei Fei, Hui Su, Xiaoyu Shen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Skip GPT reduces over 40% model parameters while matching or exceeding the performance of the original dense model across benchmarks.
Researcher Affiliation	Collaboration	1Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo 2Southwest Jiaotong University 3Tencent Inc. 4Shanghai Jiao Tong University 5The University of Hong Kong 6Nanjing University 7Meituan Inc..
Pseudocode	Yes	Algorithm 1 Training Process of Skip GPT
Open Source Code	Yes	Our code is publicly available at: https: //github.com/EIT-NLP/Skip GPT.
Open Datasets	Yes	During both router and Lo RA tuning, we use the Red Pajama-Data-1T-Sample dataset (Computer, 2023)3, which contains 850,000 samples (1 billion tokens) truncated to 4096 tokens each. This dataset serves two roles: (1) as a calibration set (100 random samples) to compute block-level significance for pruning redundant layers (static methods), and (2) as a training set for dynamic methods and for recovering static method performance (the specific details of static and dynamic methods will be introduced later in Section 4.2). Additionally, we report zero-shot PPL scores on Wiki Text2 (WT2) (Merity et al., 2016) and PTB (Marcus et al., 1993).
Dataset Splits	No	The paper uses Red Pajama-Data-1T-Sample dataset for training and tuning, and mentions a 'calibration set (100 random samples)'. It does not explicitly provide full train/validation/test splits for this dataset or how evaluation benchmarks (WT2, PTB) were specifically split for their experiments, besides implicitly using their zero-shot evaluation sets.
Hardware Specification	Yes	In practice, the tuning process requires only a single A800 (80GB) GPU and completes within four hours.
Software Dependencies	No	The paper does not provide specific software dependency versions (e.g., Python, PyTorch, CUDA versions) used for the experiments.
Experiment Setup	Yes	Each model is trained for 10,000 steps with next-token prediction, using a batch size of 16 in both router and Lo RA tuning. In the router tuning stage, we use a constant learning rate of 2e-34. Additionally, the softmax temperature τ of the Gumbel-Softmax is linearly annealed from 5 to 1. In the Lo RA tuning stage, the learning rate is set to 2e-4 with a warmup ratio of 0.1 and a cosine learning rate scheduler. The Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95 is used for gradient backpropagation. The hyperparameter α controls the strength of the sparsity penalty in the overall loss function... a value of 8 works well.