Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning
Authors: Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, Hongyuan Zhu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive ablation studies, we demonstrate that SSL can still learn meaningful representations even when positive pairs lack strict instance consistency. Furthermore, our analysis further reveals that increasing view diversity, by enforcing zero overlapping or using smaller crop scales, can enhance downstream performance on classification and dense prediction tasks. We validate our findings across a range of settings, highlighting their robustness and applicability on diverse data sources. |
| Researcher Affiliation | Academia | Huaiyuan Qin EMAIL Institute for Infocomm Research (I2R), A*STAR, Singapore Muli Yang EMAIL Institute for Infocomm Research (I2R), A*STAR, Singapore Siyuan Hu EMAIL National University of Singapore, Singapore Peng Hu EMAIL Sichuan University, China Yu Zhang EMAIL Southeast University, China Chen Gong EMAIL Shanghai Jiaotong University, China Hongyuan Zhu EMAIL Institute for Infocomm Research (I2R), A*STAR, Singapore |
| Pseudocode | Yes | Algorithm A: Procedure for EMD-based Similarity Score |
| Open Source Code | No | The paper does not provide an explicit statement about the release of their source code or a link to a code repository for the methodology described in this paper. It mentions using existing toolboxes like MMDetection, MMRotate, Monocular-Depth-Estimation-Toolbox, and MMSegmentation for evaluation, but not their own implementation code. |
| Open Datasets | Yes | We conduct SSL pre-training on two datasets: COCO for non-iconic data and Image Net-100 for object-centric data. COCO (Lin et al., 2014) is a large non-iconic dataset... Image Net-100 is a subset of the object-centric dataset Image Net-1K (Deng et al., 2009)... We evaluate the pre-trained models on a board range of downstream evaluation tasks including classification, object detection, instance segmentation and depth prediction. For object detection, we use PASCAL VOC-0712 (Everingham et al., 2010) for general object detection, and DOTA-v1.0 (Xia et al., 2018) for aerial object detection. For classification, we utilize five small-scale classification datasets: CIFAR10 (Krizhevsky et al., 2009a), CIFAR-100 (Krizhevsky et al., 2009b), DTD (Cimpoi et al., 2014), Oxford Pets (Parkhi et al., 2012), and STL-10 (Coates et al., 2011). Additionally, COCO (Lin et al., 2014) is included for the in-distribution evaluation on object detection and instance segmentation tasks. We also include depth prediction on NYUd (Silberman et al., 2012)... We provide additional validation results on medical imaging domain... NIH Chest X-ray dataset (Wang et al., 2017). |
| Dataset Splits | Yes | For object detection, we use PASCAL VOC-0712 (Everingham et al., 2010) for general object detection, and DOTA-v1.0 (Xia et al., 2018) for aerial object detection... For Mo Co-v2, we follow the evaluation protocol in Peng et al. (2022)... For DINO, we follow Caron et al. (2021)... To maintain consistency with standard classification evaluation, we filter out samples with multiple or missing labels, and report the Top-1 classification accuracy. |
| Hardware Specification | Yes | All pre-training experiments are conducted on NVIDIA RTX A6000 GPUs. All downstream experiments are conducted on NVIDIA RTX A6000 GPUs. Time are measured with the batch size of 256 on 8 NVIDIA RTX A6000 GPUs and AMD EPYC 7543 32-Core CPU, with EMD solver from Open CV. |
| Software Dependencies | No | The paper mentions several software components like Py Torch, MMDetection, MMRotate, Monocular-Depth-Estimation-Toolbox, MMSegmentation, and OpenCV, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | All models are pre-trained from scratch for 100 epochs. For the backbone, we use Res Net-50 (He et al., 2016) in Mo Co-v2 (Chen et al., 2020b), and Vi T-S (Dosovitskiy et al., 2021) with the patch size of 16 in DINO (Caron et al., 2021). Specifically, for Mo Co-v2, we set the batch size as 256 and the learning rate as 0.3 with the SGD optimizer. For DINO, we set the batch size as 256 and the learning rate as 0.0005 with the Adam W optimizer. All other training hyper-parameters follow the original settings in their respective implementations. |