Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning

Authors: Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at https://github.com/ vibalcam/Glo FND . In this section, we evaluate GLOFND in unimodal, semi-supervised unimodal, and bimodal scenarios. It is not our focus to leverage multiple techniques for achieving stateof-the-art performance, but to showcase GLOFND s improvements in identifying false negatives across different settings while being scalable to large-scale datasets (with negligible overhead) and compatible with small batch sizes. Additionally, we perform an ablation study to analyze the effect of the different components of GLOFND.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, Texas A&M University, College Station, USA 2. Correspondence to: Vicente Balmaseda <EMAIL>.
Pseudocode Yes Algorithm 1 Sog CLR + GLOFND 1: Initialize: w Rd, initialize u Rn and λ Rn 2: for t = 1, . . . , T do 3: Draw a batch of B samples B D and data augmentations A, A , and construct B i = {A(x), A (x) | x B\{xi}} for each xi B 4: for xi B do 5: Update λi according to (4) 6: Construct e B i by excluding the false negatives identified via λi and compute bg(w; xi, A, e B i ) 7: Update ui,t according to (5) 8: end for 9: Compute the gradient estimator b w 10: Update w by the momentum or Adam method 11: end for
Open Source Code Yes Our implementation is available at https://github.com/ vibalcam/Glo FND .
Open Datasets Yes We run our experiments on Image Net100 (Wu et al., 2019), a subset of Image Net with 100 randomly selected classes (about 128k images), and report scores on its official validation split. Additionally, we examine the transfer learning performance on Food-101 (Bossard et al., 2014), CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), Stanford Cars (Krause et al., 2013), Describable Textures Dataset (DTD) (Cimpoi et al., 2014), Oxford-IIIT Pets (Parkhi et al., 2012), Caltech-101 (Li et al., 2022), and Oxford 102 Flowers (Nilsback & Zisserman, 2008). For bimodal learning, we use the Conceptual Captions 3M (CC3M) (Sharma et al., 2018) dataset. We evaluate the performance by leveraging the Datacomp Benchmark (Gadre et al., 2023), which includes 38 zero-shot downstream tasks. We report the average performance, named Datacomp. For each scenario, we select the model with the best Datacomp average and also report its average performance on two subsets of the tasks: zero-shot image classification on Image Net-1k (Russakovsky et al., 2015) and 6 Image Net distribution shift datasets (Wang et al., 2019; Recht et al., 2019; Hendrycks et al., 2021b;a; Barbu et al., 2019) (IN & Variants), and zero-shot cross-modal image-text retrieval on Flickr30K (Plummer et al., 2017), MSCOCO (Lin et al., 2015), and Wino GAVi L (Bitton et al., 2022).
Dataset Splits Yes That is, we freeze the weights of the encoder at the last iteration of pretraining, remove its projection head, and train a linear classifier on top of the encoder s output. We follow a semi-supervised learning setup, where we use different fractions of labeled training data during linear evaluation, i.e., we train on random subsets of 100% (full dataset), 10%, 1%, and 0.1% of the training data. We report each top-1 accuracy on the validation set and average the performance across percentages obtaining the overall semi-supervised score. We train for 90, 285, 900, and 900 epochs corresponding to 100%, 10%, 1%, and 0.1% labeled data, respectively, with a batch size of 1024 and early stopping if the validation accuracy does not improve for 100 epochs. Caltech-101 defines no train/test split, so we randomly select 20% of images per class to create the test set.
Hardware Specification Yes The unimodal experiments are run on a single NVIDIA A30 with 24GB memory size, while the bimodal experiments make use of a multi-node setup with 2 nodes, each with 2 NVIDIA A100 GPUs with 40GB each. All the experiments are implemented using the Py Torch (Paszke et al., 2019) library. The unimodal experiments are run on a single NVIDIA A30 with 24GB memory size, while the bimodal experiments make use of a multi-node setup with 2 nodes, each with 2 NVIDIA A100 GPUs with 40GB each.
Software Dependencies No All the experiments are implemented using the Py Torch (Paszke et al., 2019) library. We adopt the same augmentation pipeline as in Sog CLR (Yuan et al., 2022), utilizing the torchvision implementation.
Experiment Setup Yes Following previous work (Yuan et al., 2022), we pretrain Res Net-50 (He et al., 2015) with a 2-layer 128 128 projection head on top of the backbone encoder. We pretrain for 200 epochs with a batch size of 128 and the same set of augmentations as in Sog CLR. We use LARS optimizer (You et al., 2017) with square root learning rate scaling (0.075 sqrt(Batch Size)) and cosine decay schedule without restart. For Sog CLR, we set the temperature (τ) to 0.1 and γ = 0.9. We start using GLOFND when we reach 70 epochs. We use α = 0.01, initialize λi = 1, and learn it with Adam with a learning rate of 0.05 (β1 = 0.9, β2 = 0.98) during the remaining epochs. For FNC, we set α = 0.01 and tune the starting epoch in {10, 30, 50, 70, 90, 110, 130}, choosing the value that achieves the best semi-supervised average performance. We train for 90, 285, 900, and 900 epochs corresponding to 100%, 10%, 1%, and 0.1% labeled data, respectively, with a batch size of 1024 and early stopping if the validation accuracy does not improve for 100 epochs. We use Adam W (Loshchilov & Hutter, 2019) with a weight decay of 0, momentum of 0.9, and a learning rate of 0.1. The same augmentation pipeline used in Sog CLR is applied for linear evaluation.