Enhancing Logits Distillation with Plug&Play Kendall’s $τ$ Ranking Loss

Authors: Yuchen Guan, Runxi Cheng, Kang Liu, Chun Yuan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on CIFAR-100, Image Net, and COCO datasets, as well as various CNN and Vi T teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China. Correspondence to: Kang Liu <EMAIL>, Chun Yuan <EMAIL>.
Pseudocode Yes A.4. Algorithm Algorithm 1 Plug-and-Play Ranking Loss for Logit Distillation
Open Source Code Yes Code is available at https://github.com/OvernighTea/Ranking Loss-KD
Open Datasets Yes 1) CIFAR-100 (Krizhevsky et al., 2009) is a significant dataset for image classification, comprising 100 categories, with 50,000 training images and 10,000 test images. 2) Image Net (Russakovsky et al., 2015) is a largescale dataset utilized for image classification, comprising 1,000 categories, with approximately 1.28 million training images and 50,000 test images. 3) MS-COCO (Lin et al., 2014) is a mainstream dataset for object detection comprising 80 categories, with 118,000 training images and 5,000 test images.
Dataset Splits Yes 1) CIFAR-100 (Krizhevsky et al., 2009) is a significant dataset for image classification, comprising 100 categories, with 50,000 training images and 10,000 test images. 2) Image Net (Russakovsky et al., 2015) is a largescale dataset utilized for image classification, comprising 1,000 categories, with approximately 1.28 million training images and 50,000 test images. 3) MS-COCO (Lin et al., 2014) is a mainstream dataset for object detection comprising 80 categories, with 118,000 training images and 5,000 test images.
Hardware Specification Yes We utilize 1 NVIDIA Ge Force RTX 4090 to train models on CIFAR-100 and 4 NVIDIA Ge Force RTX 4090 for training on Image Net. The algorithm of our method can be found in Appendix A.4. We use a single RTX4090 for CIFAR-100 and 4 RTX4090 for Image Net.
Software Dependencies No We employ SGD (Sutskever et al., 2013) as the optimizer... We use the Adam W optimizer...
Experiment Setup Yes We set the batch size to 64 for CIFAR-100, 512 for Image Net and 8 for COCO. We employ SGD (Sutskever et al., 2013) as the optimizer, with the number of epochs and learning rate settings consistent with the comparative baselines. The hyper-parameters α, β in Eq. 6 are set to be the same as the compared baselines to maintain fairness, and γ are set equal to α. ... We use the Adam W optimizer and train for 300 epochs with an initial learning rate of 5e-4 and a weight decay of 0.05. The minimum learning rate is 5e-6, and the patch size is 16. We set α = 1, β = 1, γ = 0.5, and batch size is 128.