Preserving Angles Improves Feature Distillation
Authors: Evelyn Mannix, Liam Hodgkinson, Howard Bondell
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | It is shown that distillation with Cos Press on a variety of datasets, including Image Net, produces more accurate models with greater performance on generalisability, robustness and OOD detection benchmarks, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets. ... Section 4 Experiments: Feature distillation ... Section C Ablation Studies |
| Researcher Affiliation | Academia | Evelyn J. Mannix EMAIL School of Mathematics and Statistics University of Melbourne Liam Hodgkinson EMAIL School of Mathematics and Statistics University of Melbourne Howard Bondell EMAIL School of Mathematics and Statistics University of Melbourne |
| Pseudocode | Yes | Algorithm 1 Algorithm for feature distillation with Cos Press. |
| Open Source Code | Yes | Code is available at github.com/emannix/cospress. |
| Open Datasets | Yes | student networks are distilled ... on the Image Net-1K (Russakovsky et al., 2015) training dataset... as well as nine fine-grained classification benchmarks (Oxford Pets (Parkhi et al., 2012), FGVC Aircraft (Maji et al., 2013), Describable Textures (Cimpoi et al., 2014), Stanford Cars (Krause et al., 2013), CUB200 (Wah et al., 2011), CIFAR-10/100 (Krizhevsky et al., 2009), Flowers-102 (Nilsback & Zisserman, 2008) and Food-101 (Bossard et al., 2014)) and the Pascal VOC 2012 segmentation task (Everingham et al., 2012). ... We additionally consider the Open OOD benchmarks (Yang et al., 2022a). |
| Dataset Splits | Yes | Vision Transformer (Dosovitskiy et al., 2021) models are distilled using larger DINOv2 teachers on the Image Net-1K (Russakovsky et al., 2015) training dataset, comprising 1000 categories across more than 1.2 million training images. ... evaluations are undertaken on the Image Net validation set |
| Hardware Specification | Yes | Table 9: Training time. Comparison of training time on Image Net for 300 epochs with a batch size of 1024 using Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions using an "Adam W optimizer (Loshchilov & Hutter, 2017b)", "cosine learning rate decay (Loshchilov & Hutter, 2017a)", "repeated augmentation sampler (Fort et al., 2021)", and "Rand Augment (Cubuk et al., 2020) image augmentations (Wightman, 2019)", but does not provide specific version numbers for these or other software libraries (e.g., PyTorch, Python, CUDA). |
| Experiment Setup | Yes | Following Proteus (Zhang et al., 2025), student networks are distilled for 300 epochs using a batch size of 1024, cosine learning rate decay with five warmup epochs (Loshchilov & Hutter, 2017a), an Adam W optimizer (Loshchilov & Hutter, 2017b), a repeated augmentation sampler with three views per image (Fort et al., 2021), and Rand Augment (Cubuk et al., 2020) image augmentations (Wightman, 2019). An ablation study on the hyperparameters introduced by Cos Press is provided in Section C of the supporting information. |