Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Self-Distillation for Further Pre-training of Transformers
Authors: Seanie Lee, Minki Kang, Juho Lee, Sung Ju Hwang, Kenji Kawaguchi
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the superiority of self-distillation over relevant baselines on various benchmark datasets for image and text classification tasks. |
| Researcher Affiliation | Collaboration | KAIST1, AITRICS2, National University of Singapore3 EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Self-Distillation; Algorithm 2 Further Pretrain |
| Open Source Code | No | The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface to implement all the baselines and our proposed method in the experiments.' but does not provide a link to their own implementation or explicitly state that their code is open-source. |
| Open Datasets | Yes | For image classification problem, we use six datasets FGVC Aircraft (Aircraft) (Maji et al., 2013), Caltech UCSD Birds 200 (CUB) (Wah et al., 2011), Chest X-ray (Kermany et al., 2018), Describable Textures Dataset (DTD) (Cimpoi et al., 2014), Stanford Dogs (Khosla et al., 2011), and Oxford 102 Flower (Nilsback & Zisserman, 2008). For text classification problem, we use four datasets Chemprot (Kringelum et al., 2016), ACL-ARC (Jurgens et al., 2018), SCIERC (Luan et al., 2018), and Twitter-Emotion (Mohammad et al., 2018). |
| Dataset Splits | No | The paper mentions training and test sets but does not specify clear training/validation/test splits (e.g., percentages or exact counts) for the datasets used in the main experiments, beyond mentioning '50,000 training pairs' for CIFAR-100 without explicit split details. |
| Hardware Specification | Yes | We train a Vision Transformer (Dosovitskiy et al., 2021) on CUB dataset with 3090 RTX GPU and Intel(R) Xeon(R) Silver 4210R CPU. |
| Software Dependencies | No | The paper states 'We use Pytorch (Paszke et al., 2019) and transformers library (Wolf et al., 2020) from Huggingface' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For the image classification problem, we use Vision Transformer... fine-tune it on the downstream task with Adam W optimizer... for 10,000 steps with batch size 32. Regarding further pre-training and self-distillation, we continue to pre-train the model for 20,000 steps with batch size 64. For text classification... fine-tune it on the target labeled dataset with Adam W optimizer for 10 epochs with batch size 32. In terms of further pre-training and self-distillation, we further pre-train Ro BERTA for 100 epochs with batch size 128. Appendix E (Table 9) further specifies hyperparameters like learning rate, weight decay coefficient, and rounds of self-distillation. |