Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation
Authors: Sheng-Feng Yu, Jia-Jiun Yao, Wei-Chen Chiu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on various datasets validate the superiority of our approach in terms of distillation efficiency, cross-architecture generalization, and transfer learning performance. |
| Researcher Affiliation | Collaboration | National Yang Ming Chiao Tung University1, Macronix International Co., Ltd.2 EMAIL, EMAIL |
| Pseudocode | Yes | Here we provide the pseudo code (i.e. Algorithm 1) together with the detailed but compact explanation to emphasize the systematic approach of our proposed method for self-supervised dataset distillation, which begins with initializing the framework and proceeds through a bilevel optimization process, ending with the training of approximation networks to capture representation shifts due to the augmentations (e.g. rotations). |
| Open Source Code | No | The paper mentions using 'solo-learn library (da Costa et al., 2022)' for training the teacher model, but it does not explicitly state that the authors' own implementation code for the methodology described in the paper is open-source or provide a link to it. |
| Open Datasets | Yes | Datasets. CIFAR100 (Krizhevsky, 2009), Tiny Image Net (Le & Yang, 2015), and Image Net (Deng et al., 2009) are taken as our source datasets for performing self-supervised DD, while the distilled dataset is evaluated upon the target datasets (which include the source datasets themselves, CIFAR10 (Krizhevsky, 2009), CUB2011 (Wah et al., 2011), and Stanford Dogs (Khosla et al., 2011)). |
| Dataset Splits | Yes | The distilled dataset is evaluated upon the target datasets (which include the source datasets themselves, CIFAR10 (Krizhevsky, 2009), CUB2011 (Wah et al., 2011), and Stanford Dogs (Khosla et al., 2011) for the classification). ... The goal of our distilled dataset (...) is for further use of training a new model (...) to mimic the characteristics of the selfsupervisedly pretrained teacher model gϕ, its evaluation follows the typical linear evaluation scheme of self-supervised learning works: the new model (...) learnt from (...) is frozen and coupled with a linear classifier, where the linear classifier is trained upon the supervised dataset of a downstream task. |
| Hardware Specification | Yes | Computational cost of distilling CIFAR-100 with storage buffer N = 100 using a single Nvidia RTX 4090 GPU card. |
| Software Dependencies | No | The inner model adopted in our approach utilizes convolutional layers that include batch normalization (Ioffe & Szegedy, 2015), Re LU activation, and average pooling. ... To optimize our distilled dataset, we employ the Adam W optimizer (Loshchilov & Hutter, 2019)... The Res Net18 model (He et al., 2016) is serving as a self-supervised teacher gϕ and is trained with the Barlow Twins objective (Zbontar et al., 2021) (where the training is based on the solo-learn library (da Costa et al., 2022)). The paper mentions software components and libraries used but does not specify their version numbers. |
| Experiment Setup | Yes | The model pool for inner models (...) consists of 10 models, which are initialized and updated via full-batch gradient descent, with learning rate and momentum set to 0.1, 0.9, respectively. The update steps Z are 1,000. To optimize our distilled dataset, we employ the Adam W optimizer (...), starting with a learning rate of 0.001 that linearly decayed. This distillation process involves 30,000 outer iterations for CIFAR100 and 20,000 for Tiny Image Net and Image Net. ... Upon completion of the distillation process and stepping forward to evaluation, we pretrain a model (...) on the distilled dataset for 1,000 epochs. This pretraining employs a stochastic gradient descent (SGD) optimizer with a mini-batch size of 256, where the learning rate and momentum are maintained at 0.1 and 0.9, respectively. The weight decay parameters during pretraining the feature extractor are listed in Table 5, we set the weight decay parameters which depends on the size of distilled dataset. For training the linear classifier to conduct linear evaluation, we standardize the experimental settings to utilize the SGD optimizer with a momentum of 0.9, excluding weight decay, and initiate the learning rate of taskspecific head to 0.2 with cosine scheduling. |