Preserving Angles Improves Feature Distillation

Authors: Evelyn Mannix, Liam Hodgkinson, Howard Bondell

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental It is shown that distillation with Cos Press on a variety of datasets, including Image Net, produces more accurate models with greater performance on generalisability, robustness and OOD detection benchmarks, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets. ... Section 4 Experiments: Feature distillation ... Section C Ablation Studies
Researcher Affiliation Academia Evelyn J. Mannix EMAIL School of Mathematics and Statistics University of Melbourne Liam Hodgkinson EMAIL School of Mathematics and Statistics University of Melbourne Howard Bondell EMAIL School of Mathematics and Statistics University of Melbourne
Pseudocode Yes Algorithm 1 Algorithm for feature distillation with Cos Press.
Open Source Code Yes Code is available at github.com/emannix/cospress.
Open Datasets Yes student networks are distilled ... on the Image Net-1K (Russakovsky et al., 2015) training dataset... as well as nine fine-grained classification benchmarks (Oxford Pets (Parkhi et al., 2012), FGVC Aircraft (Maji et al., 2013), Describable Textures (Cimpoi et al., 2014), Stanford Cars (Krause et al., 2013), CUB200 (Wah et al., 2011), CIFAR-10/100 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008) and Food-101 (Bossard et al., 2014)) and the Pascal VOC 2012 segmentation task (Everingham et al., 2012). ... We additionally consider the Open OOD benchmarks (Yang et al., 2022a).
Dataset Splits Yes Vision Transformer (Dosovitskiy et al., 2021) models are distilled using larger DINOv2 teachers on the Image Net-1K (Russakovsky et al., 2015) training dataset, comprising 1000 categories across more than 1.2 million training images. ... evaluations are undertaken on the Image Net validation set
Hardware Specification Yes Table 9: Training time. Comparison of training time on Image Net for 300 epochs with a batch size of 1024 using Nvidia A100 GPUs.
Software Dependencies No The paper mentions using an "Adam W optimizer (Loshchilov & Hutter, 2017b)", "cosine learning rate decay (Loshchilov & Hutter, 2017a)", "repeated augmentation sampler (Fort et al., 2021)", and "Rand Augment (Cubuk et al., 2020) image augmentations (Wightman, 2019)", but does not provide specific version numbers for these or other software libraries (e.g., PyTorch, Python, CUDA).
Experiment Setup Yes Following Proteus (Zhang et al., 2025), student networks are distilled for 300 epochs using a batch size of 1024, cosine learning rate decay with five warmup epochs (Loshchilov & Hutter, 2017a), an Adam W optimizer (Loshchilov & Hutter, 2017b), a repeated augmentation sampler with three views per image (Fort et al., 2021), and Rand Augment (Cubuk et al., 2020) image augmentations (Wightman, 2019). An ablation study on the hyperparameters introduced by Cos Press is provided in Section C of the supporting information.