Simplifying Knowledge Transfer in Pretrained Models
Authors: Siddharth Jain, Shyamgopal Karthik, Vineet Gandhi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of Vi T-B by approximately 1.4% through bidirectional knowledge transfer with Vi T-T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state-of-the-art. |
| Researcher Affiliation | Academia | Siddharth Jain EMAIL Center for Visual Information Technology International Institute of Information Technology, Hyderabad Shyamgopal Karthik EMAIL University of Tübingen Vineet Gandhi EMAIL Center for Visual Information Technology International Institute of Information Technology, Hyderabad |
| Pseudocode | Yes | Algorithm 1: Bi-KD Input: Training set X, label set Y, learning rate η, epochs Tmax, iterations Nmax, models f1 and f2 parameterized by θ1 and θ2 respectively |
| Open Source Code | Yes | The code is available at: https://github.com/Syd-J/Bi-KD |
| Open Datasets | Yes | Image Net (Deng et al., 2009) consists of 1.2 million images for training and 50,000 images for validation. We report the results of our knowledge transfer between two or multiple models on the validation set. ADE20K (Zhou et al., 2017) provides 150 object and stuff categories, with 20,210 images in the training set and 2,000 images in the validation set. We use the validation set to evaluate our approach for knowledge transfer on semantic segmentation. DHF1K (Wang et al., 2018) is a benchmark dataset for video saliency prediction, comprising 600 videos in the training set and 100 videos in the validation set. We use the validation set for our evaluation. Hollywood-2 (Mathe & Sminchisescu, 2014) is the largest dataset for video saliency prediction in terms of the number of videos, containing 1,707 clips sourced from 69 Hollywood movies. |
| Dataset Splits | Yes | Image Net (Deng et al., 2009) consists of 1.2 million images for training and 50,000 images for validation. ADE20K (Zhou et al., 2017) provides 150 object and stuff categories, with 20,210 images in the training set and 2,000 images in the validation set. DHF1K (Wang et al., 2018) is a benchmark dataset for video saliency prediction, comprising 600 videos in the training set and 100 videos in the validation set. Hollywood-2 (Mathe & Sminchisescu, 2014)... we use the predefined split of 823 videos for training and the remaining 884 videos for testing. |
| Hardware Specification | Yes | We implement all the networks and training procedures in Pytorch (Paszke et al., 2019), and conduct all experiments on a single NVIDIA RTX A6000. |
| Software Dependencies | No | We implement all the networks and training procedures in Pytorch (Paszke et al., 2019), and conduct all experiments on a single NVIDIA RTX A6000. ... The only exceptions are Vi Ts, for which we employ the default data augmentations and cosine scheduler provided by the timm (Wightman et al., 2019) library. Explanation: The paper mentions 'Pytorch' and 'timm library' with citations, but does not specify exact version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We use the Adam optimizer for image classification and video saliency prediction, while experiments on semantic segmentation utilize the Adam W optimizer. For all experiments, the learning rate and weight decay are set to 1e-6 and 1e-5 respectively, with the temperature parameter set to 1 in Equation 1. All models are trained in full precision for 20 epochs with a batch size of 128. We do not apply any data augmentations, learning rate schedulers, or layer-wise learning rate decay. |