Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Authors: Yong Guo, Shulian Zhang, Haolin Pan, Jing Liu, Yulun Zhang, Jian Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pre-trained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on Res Net18, respectively. The code is available at https://github.com/guoyongcs/GPD. 4 EXPERIMENTS Table 1: Comparison of the performance of various distillation methods across different architectures. denotes the result that is not reported. A B indicates a teacher model A distilling knowledge to a student model B. GPD consistently enhances the performance of standard distillation methods across diverse architectures.
Researcher Affiliation Academia 1Max Planck Institute for Informatics, 2South China University of Technology, 3Monash University, 4Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 Training process of Gap Preserving Distillation (GPD). Input: Student S, static teacher Ts, epochs N, step size η, model parameters W, weight of standard knowledge distillation loss λ, knowledge function ψ( ), training data (x, y)
Open Source Code Yes The code is available at https://github.com/guoyongcs/GPD.
Open Datasets Yes Comprehensive experiments on the Image Net dataset validate the effectiveness of GPD in boosting the performance of standard knowledge distillation methods across various backbone architectures.
Dataset Splits Yes For convolutional neural networks, we strictly follow the settings from Zhao et al. (2022a); Chen et al. (2021b). E.1 DISTILLATION WITH A STATIC TEACHER In this experiment, we adopt the standard data pre-processing pipeline, including random cropping, resizing to 224 224, random horizontal flipping, and normalization. E.2 TRAIN FROM SCRATCH In this experiment, we train Res Net18 and Mobile Net for 100 epochs, and RVT-Ti for 300 epochs. E.3 MODEL FINE-TUNING In this part, we begin with pre-trained student models and aim to further improve their performance through our proposed approach. The pre-trained models are fine-tuned for 50 epochs, with the initial learning rate set to 0.1x the initial learning rate used during the pretraining stage.
Hardware Specification Yes For convolutional neural networks, the batch size is set to 256 on 4 Nvidia Tesla V100 GPUs, while for vision transformers (Vi Ts), the batch size is set to 256 on 8 Nvidia Tesla V100 GPUs. Table 12: Comparison of computation cost on the experiment of Res Net50 Mobile Net distillation. We measure the training time on 4 A100 GPUs with a batch size of 512 on Image Net.
Software Dependencies No The paper mentions employing the SGD optimizer but does not specify software libraries like PyTorch or TensorFlow, nor their version numbers.
Experiment Setup Yes By default, we employ the SGD optimizer with an initial learning rate of 0.1 and a momentum of 0.9. For convolutional neural networks, the batch size is set to 256 on 4 Nvidia Tesla V100 GPUs, while for vision transformers (Vi Ts), the batch size is set to 256 on 8 Nvidia Tesla V100 GPUs. The models are trained for 100 epochs with a learning rate decay factor of 0.1 applied every 30 epochs. The weight decay is set to 1e-4, and the weight for the KD loss between the student and the dynamic teacher is set to 3.0, while the weights for the other KD losses and the cross-entropy loss are both set to 1.0.