Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning

Authors: Amin Karimi Monsefi, Mengxi Zhou, Nastaran Monsefi, Ser-Nam Lim, Wei-Lun Chao, Rajiv Ramnath

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation. (Abstract) ... Through extensive experimentation, we demonstrate the efficacy of FOLK. Our findings indicate that FOLK performs on par or better than many state-of-the-art MIM and MFM techniques in various downstream tasks, including image classification, few-shot learning, and semantic segmentation. (Section 1 - Contributions) ... In this section, we detail the experimental setup and evaluate our proposed FOLK framework on classification tasks using both full fine-tuning and few-shot learning approaches. Additional experiments, including semantic segmentation as a downstream task and ablation studies, are provided in Appendix B for further insights and comprehensive results. (Section 4 - Experiments)
Researcher Affiliation Academia Amin Karimi Monsefi , Mengxi Zhou , Nastaran Karimi Monsefi Ser-Nam Lim , Wei-Lun Chao , Rajiv Ramnath EMAIL, EMAIL The Ohio State University, Hamedan University of Technology, University of Central Florida
Pseudocode Yes The generation of Com and RCom filters is illustrated in Fig. 2, with the pseudocode available in Appendix A.4. (Section 3.2.1) ... Algorithm 1 presents the pseudocode for our proposed Com and RCom filters (denoted as M in Eq. 2 and Eq. 3). (Appendix A.4)
Open Source Code Yes https://github.com/amin K8/FOLK (Abstract)
Open Datasets Yes We adopt the Image Net-1K training dataset (Deng et al., 2009) without labels for pre-training our self-supervised learning. (Section 4) ... For image classification, we continue to leverage the Image Net-1K dataset (Deng et al., 2009) to assess the generalizability and effectiveness of the learned features. In contrast, for semantic segmentation, we utilize the ADE20K dataset (Zhou et al., 2017), a standard benchmark in scene parsing and segmentation tasks. (Section 4) ... Threshold(s) CIFAR-10 CIFAR-100 Image Net-1k (Table 8, Appendix B.5.1)
Dataset Splits Yes In this experiment, we aim to highlight FOLK s superior adaptability and efficiency by fine-tuning pre-trained models using only 10% of the Image Net-1K dataset over 200 epochs. (Section 4.2.2) ... Table 7 presents an extended evaluation of few-shot learning performance using a smaller set of labeled data. Various pre-trained models were fine-tuned using only 1% of the Image Net-1K dataset over 1000 epochs. (Appendix B.4) ... We ran 200 epochs for fine-tuning the pre-trained model (i.e. Vi T-S/16 or Vi T-B/16) on Image Net1K for image classification... (Appendix A.2.1) ... The full fine-tuning Vi T-S/16 model for semantic segmentation task with ADE20K dataset. (Table 3, Appendix B.1)
Hardware Specification Yes Our computational infrastructure supports these extensive experiments, consisting of four nodes, each of which has four NVIDIA A100 80GB GPUs, in total 16 GPUs. (Section 4)
Software Dependencies No We used the Py Torch Library (Paszke et al., 2019) for our code development. (Appendix A.1) The paper mentions the PyTorch library but does not specify its version number or any other software dependencies with version numbers.
Experiment Setup Yes We employ the Adam W optimizer (Loshchilov & Hutter, 2019), with a pre-training duration set to 300 or 800 epochs, a batch size of 2048, 128 per GPU, and a peak learning rate of 1.2 10 3. Additional parameters include a cosine decay learning rate schedule, 20 warmup epochs, and a specific setting for optimizer momentum (β1, β2 = 0.9, 0.95) (Chen et al., 2020a) with a weight decay of 0.05. Also, we used a value of 3.0 for gradient clipping to prevent the exploding gradient problem. (Appendix A.1 - Pre-train Stage) ... We ran 200 epochs for fine-tuning the pre-trained model (i.e. Vi T-S/16 or Vi T-B/16) on Image Net1K for image classification, employing the Adam W optimizer across all configurations with a weight decay of 0.05 and the optimizer momentum β1, β2 = 0.9, 0.999. Moreover, the approach includes a cosine decay learning rate schedule (Li & Arora, 2020), with a layer-wise learning rate decay equal to 0.8 (Bao et al., 2021; Clark et al., 2020). We also utilized advanced augmentation techniques such as Mixup (Zhang ets al., 2018) and Cutmix (Yun et al., 2019), as well as label smoothing and random augmentation to further improve model robustness and generalization capability (Szegedy et al., 2016; Cubuk et al., 2020). The batch size is maintained at 2048, with a peak learning rate set at 8 10 3. (Appendix A.2.1 - Classification Task) ... Ltot = α Ldis + LMFM. (6) where a hyperparameter α controls the weights between two loss terms, which is set as 1 in our experiments, unless stated otherwise. (Section 3.2.3)