LeOCLR: Leveraging Original Images for Contrastive Learning of Visual Representations
Authors: Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Le OCLR consistently improves representation learning across various datasets, outperforming baseline models. For instance, Le OCLR surpasses Mo Co-v2 by 5.1% on Image Net-1K in linear evaluation and outperforms several other methods on transfer learning and object detection tasks. 4 Experiments and Results |
| Researcher Affiliation | Academia | Mohammad Alkhalefi EMAIL Department of Computing Science University of Aberdeen Georgios Leontidis EMAIL Department of Computing Science & Interdisciplinary Institute University of Aberdeen Mingjun Zhong EMAIL Department of Computing Science University of Aberdeen |
| Pseudocode | Yes | Algorithm 1 Proposed Approach 1: for X in dataloader do 2: X1, X2= rc(X) random crop first and second views 3: X,X1,X2 = augment(X,X1,X2) apply random augmentation for all the views 4: X = fq(X) encode the original image 5: X1 = fk(X1) encode the first view by momentum encoder 6: X2 = fk(X2) encode the second view by momentum encoder 7: loss1 = ℓ(X,X1) computed as shown in eq.1 8: loss2 = ℓ(X,X2) computed as shown in eq.1 9: lt = loss1 + loss2 computed the total loss as shown in eq.2 10: end for 11: 12: def rc(x): 13: x= T.Random Resized Crop(224,224) T is transformation from torchvision module 14: return x |
| Open Source Code | No | The paper does not explicitly provide a link to source code, a statement about code release, or mention code in supplementary materials. |
| Open Datasets | Yes | Datasets: We conducted multiple experiments on three datasets: STL-10 "unlabeled" with 100K training images (Coates & Ng, 2011), CIFAR-10 with 50K training images (Krizhevsky, 2009), and Image Net-1K with 1.28M training images (Russakovsky et al., 2015). For instance, our approach outperforms vanilla Mo Co-v2, achieving accuracies of 5.12% and 5.71% on STL-10 and CIFAR10, respectively. Finally, we use the PASCAL VOC (Everingham et al., 2010) dataset for object detection. Non-object-centric datasets, such as COCO (Lin et al., 2014), portray real scenes where the objects of interest are not centered or prominently situated, as opposed to object-centric datasets like Image Net-1K. We evaluate our self-supervised pretrained model using transfer learning by fine-tuning it on small datasets such as CIFAR (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), Oxford-IIIT Pets (Parkhi et al., 2012), and Birdsnap (Berg et al., 2014). |
| Dataset Splits | Yes | For linear evaluation, we followed the standard evaluation protocol (Chen et al., 2020a; He et al., 2020; Huynh et al., 2022; Dwibedi et al., 2021), where a linear classifier was trained for 100 epochs on top of a frozen backbone pre-trained with Le OCLR. The Image Net-1K training set was used to train the linear classifier from scratch, with random cropping and left-to-right flipping augmentations. Results are reported on the Image Net-1K validation set using a center crop (224 224). In the semi-supervised setting, we fine-tuned the network for 60 epochs using 1% labeled data and 30 epochs using 10% labeled data. We follow the same settings as Mo Co-v2 (Chen et al., 2020b), fine-tuning on the VOC07+12 trainval dataset using Faster R-CNN with an R50-C4 backbone, and evaluating on VOC07 test dataset. The model is fine-tuned for 24k iterations (~23 epochs). In the ablation study, we compare the fine-tuned representations of our approach with the reproduced vanilla Mo Co-v2(Chen et al., 2020b) across 1%, 2%, 5%, 10%, 20%, 50%, and 100% of the Image Net-1K dataset, following the methodology in (Henaff, 2020; Grill et al., 2020). |
| Hardware Specification | Yes | To address concerns about the increased computational cost associated with training Le OCLR compared to Mo Co V2, we include the training time for both approaches in Table 7. We trained both models on three A100 GPUs with 80GB for 200 epochs. |
| Software Dependencies | No | The paper mentions 'torchvision module' but does not specify any software versions for libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | Training Setup: We used Res Net50 as the backbone, and the model was trained with the SGD optimizer, with a weight decay 0.0001, momentum of 0.9, and an initial learning rate of 0.03. The mini-batch size was 256, and the model was trained for up to 800 epochs on Image Net-1K. Evaluation: We used different downstream tasks to evaluate the Le OCLR’s representation learning against leading SOTA approaches on Image Net-1K: linear evaluation, semi-supervised learning, transfer learning, and object detection. For linear evaluation, we followed the standard evaluation protocol (Chen et al., 2020a; He et al., 2020; Huynh et al., 2022; Dwibedi et al., 2021), where a linear classifier was trained for 100 epochs on top of a frozen backbone pre-trained with Le OCLR. The Image Net-1K training set was used to train the linear classifier from scratch, with random cropping and left-to-right flipping augmentations. Results are reported on the Image Net-1K validation set using a center crop (224 × 224). In the semi-supervised setting, we fine-tuned the network for 60 epochs using 1% labeled data and 30 epochs using 10% labeled data. |