Enabling Visual Foundation Models to Teach Compact Students via Mixture of Distillation

Authors: Xinye Yang, Shang Wang, Li Luking, Yipeng Chen

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various classification, detection, and medical segmentation tasks validate the effectiveness of our approach with other models.
Researcher Affiliation Academia Xinye Yang1 , Shang Wang2 , Li Luking and Yipeng Chen3 1Newcastle University 2Indepentdent Researcher 3University of Science and Technology Beijing
Pseudocode No The paper describes the methodology in prose and with a diagram in Figure 1, but does not include a structured pseudocode or algorithm block.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets Yes We assess our framework on fine-grained classification datasets (e.g., Stanford Cars[Krause et al., 2013], Oxford Pets[Parkhi et al., 2012], CIFAR-100[Alex, 2009] and Food-101[Bossard et al., 2014]), large-scale dataset Image Net-1K[Deng et al., 2009]. For detection on MS-COCO dataset [Caesar et al., 2018]... This ongoing evaluation includes three 2D medical datasets [Fang et al., 2022], ISIC2018, CVCClinic DB, and Kvasir, and two 3D medical datasets [Wang et al., 2021], Synapse and ACDC.
Dataset Splits Yes The results on the Image Net-1K[Deng et al., 2009] dataset, as shown in Table ??. In the annotation-free setting... Table 3 and Table 4 demonstrate the remarkable gains achieved by our method with Grounding DINO-L as teacher across various object detection settings... on COCO val set.
Hardware Specification No The paper does not provide specific hardware details (like GPU models, CPU types, or memory amounts) used for running its experiments. It mentions that 'Detailed implementations are in the Appendix,' but the appendix is not included.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It states that 'Detailed implementations are in the Appendix,' but this information is not available in the main body of the paper.
Experiment Setup Yes Table 8 presents ablation study on KD configurations, we find the following: (1) The Vi T-L/14 teacher achieves the highest performance at 79.82%, followed by Vi T-B/32 at 79.47% and Res Net50 at 78.95%. (2) Loss weight of 1 produces the best results at 79.82%, while lower or higher weights lead to slightly reduced performance. (3) Temperature of 4 yields the highest accuracy of 79.82%, indicating this level of logit scaling is most effective for knowledge distillation.