LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations

Authors: Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Le OCLR consistently improves representation learning across various datasets, outperforming baseline models. For instance, Le OCLR surpasses Mo Co-v2 by 5.1% on Image Net-1K in linear evaluation and outperforms several other methods on transfer learning and object detection tasks. 4 Experiments and Results
Researcher Affiliation Academia Mohammad Alkhalefi EMAIL Department of Computing Science University of Aberdeen Georgios Leontidis EMAIL Department of Computing Science & Interdisciplinary Institute University of Aberdeen Mingjun Zhong EMAIL Department of Computing Science University of Aberdeen
Pseudocode Yes Algorithm 1 Proposed Approach 1: for X in dataloader do 2: X1, X2= rc(X) random crop first and second views 3: X,X1,X2 = augment(X,X1,X2) apply random augmentation for all the views 4: X = fq(X) encode the original image 5: X1 = fk(X1) encode the first view by momentum encoder 6: X2 = fk(X2) encode the second view by momentum encoder 7: loss1 = ℓ(X,X1) computed as shown in eq.1 8: loss2 = ℓ(X,X2) computed as shown in eq.1 9: lt = loss1 + loss2 computed the total loss as shown in eq.2 10: end for 11: 12: def rc(x): 13: x= T.Random Resized Crop(224,224) T is transformation from torchvision module 14: return x
Open Source Code No The paper does not explicitly provide a link to source code, a statement about code release, or mention code in supplementary materials.
Open Datasets Yes Datasets: We conducted multiple experiments on three datasets: STL-10 "unlabeled" with 100K training images (Coates & Ng, 2011), CIFAR-10 with 50K training images (Krizhevsky, 2009), and Image Net-1K with 1.28M training images (Russakovsky et al., 2015). For instance, our approach outperforms vanilla Mo Co-v2, achieving accuracies of 5.12% and 5.71% on STL-10 and CIFAR10, respectively. Finally, we use the PASCAL VOC (Everingham et al., 2010) dataset for object detection. Non-object-centric datasets, such as COCO (Lin et al., 2014), portray real scenes where the objects of interest are not centered or prominently situated, as opposed to object-centric datasets like Image Net-1K. We evaluate our self-supervised pretrained model using transfer learning by fine-tuning it on small datasets such as CIFAR (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), Oxford-IIIT Pets (Parkhi et al., 2012), and Birdsnap (Berg et al., 2014).
Dataset Splits Yes For linear evaluation, we followed the standard evaluation protocol (Chen et al., 2020a; He et al., 2020; Huynh et al., 2022; Dwibedi et al., 2021), where a linear classifier was trained for 100 epochs on top of a frozen backbone pre-trained with Le OCLR. The Image Net-1K training set was used to train the linear classifier from scratch, with random cropping and left-to-right flipping augmentations. Results are reported on the Image Net-1K validation set using a center crop (224 224). In the semi-supervised setting, we fine-tuned the network for 60 epochs using 1% labeled data and 30 epochs using 10% labeled data. We follow the same settings as Mo Co-v2 (Chen et al., 2020b), fine-tuning on the VOC07+12 trainval dataset using Faster R-CNN with an R50-C4 backbone, and evaluating on VOC07 test dataset. The model is fine-tuned for 24k iterations (~23 epochs). In the ablation study, we compare the fine-tuned representations of our approach with the reproduced vanilla Mo Co-v2(Chen et al., 2020b) across 1%, 2%, 5%, 10%, 20%, 50%, and 100% of the Image Net-1K dataset, following the methodology in (Henaff, 2020; Grill et al., 2020).
Hardware Specification Yes To address concerns about the increased computational cost associated with training Le OCLR compared to Mo Co V2, we include the training time for both approaches in Table 7. We trained both models on three A100 GPUs with 80GB for 200 epochs.
Software Dependencies No The paper mentions 'torchvision module' but does not specify any software versions for libraries, frameworks, or programming languages.
Experiment Setup Yes Training Setup: We used Res Net50 as the backbone, and the model was trained with the SGD optimizer, with a weight decay 0.0001, momentum of 0.9, and an initial learning rate of 0.03. The mini-batch size was 256, and the model was trained for up to 800 epochs on Image Net-1K. Evaluation: We used different downstream tasks to evaluate the Le OCLR’s representation learning against leading SOTA approaches on Image Net-1K: linear evaluation, semi-supervised learning, transfer learning, and object detection. For linear evaluation, we followed the standard evaluation protocol (Chen et al., 2020a; He et al., 2020; Huynh et al., 2022; Dwibedi et al., 2021), where a linear classifier was trained for 100 epochs on top of a frozen backbone pre-trained with Le OCLR. The Image Net-1K training set was used to train the linear classifier from scratch, with random cropping and left-to-right flipping augmentations. Results are reported on the Image Net-1K validation set using a center crop (224 × 224). In the semi-supervised setting, we fine-tuned the network for 60 epochs using 1% labeled data and 30 epochs using 10% labeled data.