$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

Authors: Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our study spans three scales: Image Net-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and visionlanguage models trained on the same data across a range of tasks.
Researcher Affiliation Collaboration 1Meta FAIR 2New York Univeristy 3Brown University 4Genentech 5CIFAR
Pseudocode Yes Figure 1: a) The diagram of X-CLR. X-CLR objective learns representations of images with the help of a soft relationship graph. The graph can be built based on accompanying data, e.g. taxonomy for biological data. In our experiments, we use captioned images, and build similarities based on caption similarities. b) Python-style pseudo-code of X-CLR with similarity based on text captions.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We test X-CLR on three datasets of varying scale: Image Net (Deng et al., 2009) (1M), and Conceptual Captions 3M and 12M (Sharma et al., 2018). ... We test on Image Net classification with standard as well as with Image Net Real labels, on Image Net-9 to test robustness to background change (we refer to this as Background Decomposition in our results), on Object Net to test robustness to context and view change, and on MIT-States objects and attributes classification to test how well model captures object states.
Dataset Splits Yes MIT States In order to evaluate on this dataset using linear probing, we split the dataset randomly into two even parts, one used for training the linear layer, the other for evaluation. ... Image Net-9 (Xiao et al., 2020) proposes multiple benchmarks to test model robustness to the background perturbation. The benchmark is created by taking samples from Image Net, segmenting the object in the scene, and swapping out the background. Since the benchmark uses the same classes as Image Net, we do not retrain the Image Net classifier.
Hardware Specification Yes To train on Image Net, we used 8 Nvidia V100s, and each run took about 30 hours.
Software Dependencies No We use the Sentence Transformer (Reimers and Gurevych, 2019) as the text encoder to construct similarities unless stated otherwise. ... We used the NTLK library Bird et al. (2009), and used the Wu-Palmer similarity (Wu and Palmer, 1994) between the class synsets; ... The paper mentions software components but does not provide specific version numbers for them.
Experiment Setup Yes All experiments on the Image Net dataset were run for 100 epochs with 1024 batch size. The learning rate was set to 0.075 for Image Net models. For experiments on CC3M and CC12M, we used the standard Sim CLR augmentations, and a learning rate of 0.1. ... We train Sim CLR, Sup Con and X-CLR using the LARS optimizer (You et al., 2017). ... The output dimension of the projector is 128.