Unified Wisdom: Harnessing Collaborative Learning to Improve Efficacy of Knowledge Distillation

Authors: Atharva Abhijit Tambat, Durga S, Ganesh Ramakrishnan, Pradeep Shenoy

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate MC-Distil across diverse student teacher architectures and model sizes (Section 4.1.3), demonstrating consistent gains over SOTA benchmarks. MC-Distil improves the performance of each student, highlighting the interesting finding that smaller student models can help improve larger students in collaborative distillation. Finally, MC-Distil yields improved generalization and robustness (Section 4.4). Also numerous tables like Table 1 show evaluation on datasets and comparison with baselines.
Researcher Affiliation Collaboration Atharva Abhijit Tambat EMAIL Department of Computer Science and Engineering Indian Institute of Technology, Bombay and Pradeep Shenoy EMAIL Google Deep Mind
Pseudocode Yes Algorithm 1 The MC-Distil approach: learning student S1, Sk, Training data D, Validation data V, teacher T and C-Net gϕ.
Open Source Code Yes We have released the code at the following URL: https://github.com/AtharvaTambat/MC-Distil
Open Datasets Yes For image classification, we evaluate across diverse datasets CIFAR-100 (Krizhevsky, 2009), Tiny Image Net (Le & Yang, 2015) and Image Net-1K (Russakovsky et al., 2015). In addition, we consider i Wild Cam (Beery et al., 2020), Tiny Image Net-C (Hendrycks & Dietterich, 2019), and Clothing1M (Xiao et al., 2015) as part of our further analysis. ... For object detection, we evaluate on the MS COCO benchmark (Lin et al., 2014) ... We illustrate this experiment on two datasets, namely: Instance CIFAR-100 (Xia et al., 2020) and Clothing-1M (Xiao et al., 2015).
Dataset Splits Yes CIFAR-100 (Krizhevsky, 2009). The dataset consists of a total of 60K examples... the training set encompasses 50,000 examples, while the remaining 10K serve as the testing set. In our experimental setup, approximately 5K examples are allocated for use as a validation set... Image Net-1K (Russakovsky et al., 2015). The dataset comprises over 1.2 million training images and 50,000 validation images... A subset of 50K images from the training set is held out as a validation set... Tiny Image Net (Le & Yang, 2015). ...we have set aside an independent validation set comprising 10K examples... i Wild Cam-2020 (Beery et al., 2020) ...we trained the models using approx 100k training, 12k validation and 12k test images... MS COCO (Lin et al., 2014) ...we use the standard training-validation split, selecting a subset of 5K images from the training set as a validation set...
Hardware Specification Yes We our experiments on a mixture of GPUS viz. A100s and RTX 2080 as our experiments don t need any sophisticated modern GPUs.
Software Dependencies No For all the image classification datasets, we use the data augmentation methods, using the torchvision s transforms module. We use Random Crop, Random Resized Crop, Random Sized Crop, Random Horizontal Flip, Normalize, and Color Jitter for our purpose. ... For all the KD baselines listed in Section 4.1.2 and MC-Distil, we use temperature τ = 2, employ the SGD optimizer to train the student models and ADAM optimizer (Kingma & Ba, 2014) to train C-Net. ... We use cosine annealing (Loshchilov & Hutter, 2017) as the learning rate schedule for training the student models. The text mentions software components like 'torchvision', 'SGD optimizer', 'ADAM optimizer', and 'cosine annealing' but does not provide specific version numbers for any of these.
Experiment Setup Yes For all the KD baselines listed in Section 4.1.2 and MC-Distil, we use temperature τ = 2, employ the SGD optimizer to train the student models and ADAM optimizer (Kingma & Ba, 2014) to train C-Net. We use a batch size of 400 for all datasets. We train the student models for 300 epochs on CIFAR-100, Instance CIFAR-100, Clothing-1M, Tiny Image Net and Tiny Image Net-C datasets; and for 100 epochs on i Wild Cam dataset, Image Net and MS COCO while updating C-Net every 20 epoch (i.e., L = 20). We use cosine annealing (Loshchilov & Hutter, 2017) as the learning rate schedule for training the student models. We warm start each student model by first training it using the cross-entropy loss without using the teacher model for all KD baselines and MC-Distil. For all datasets, we perform a grid search on 0.1, 0.05 for the learning rate, on 1e 5, 5e 5, 1e 4 for the weight decay, and 0.65, 0.75, 0.85, 0.95 for the momentum. For the C-Net training we use a learning rate of 1e 3 and set weight decay to 1e 4.