Discovering Clone Negatives via Adaptive Contrastive Learning for Image-Text Matching
Authors: Renjie Pan, Jihao Dong, Hua Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across several tasks demonstrate the effectiveness of Ada CL in image-text matching. Furthermore, we extend Ada CL to weakly-supervised image-text matching by replacing human-annotated descriptions with automatically generated captions, thereby increasing the number of potential clone negatives. Ada CL maintains robustness in this setting, alleviating the reliance on crowd-sourced annotations and laying a foundation for scalable vision-language contrastive learning. |
| Researcher Affiliation | Academia | Renjie Pan, Jihao Dong, Hua Yang Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai Jiao Tong University EMAIL |
| Pseudocode | Yes | Algorithm 1 Adaptive Contrastive Learning Input: a mini-batch of N image-text pairs, with N positives and N M negatives. Output: Lada 1: for each mini-batch do 2: Select in-batch grounding salient negatives and clone negatives; Ssln = {si+j} | j M, i+ = argmaxi Salient Scorei, Scln = {si j} | j M, i = argmini Salient Scorei, 3: Sort out in-batch potential clone negatives S , and select anchor based on S; p (C | s) = 1/{1 + π cπc σc σ c exp h (s µc)22σ2c (s µ c)2 S := {s | p (C | s) > p C | s }, anchor := spos | δ = median( S), 4: Obtain the probability of anchor for tuning; ˆpu = exp[m1(anchor m2)] exp[m1(anchor m2)]+P 5: Compute m1 and m2 according to Eq. 4 and Eq. 6; m1 = log( ϵ ˆpu (1 ϵ)(1 ˆpu))/(anchor 1), m2 = anchor + log( 1 ˆpu ˆpu P anchor )/m1; 6: Update ˆpi(I) and Lada; ˆpi(I) = exp[m1(s(I,Ti) m2)] exp[m1(s(I,Ti) m2)]+PM+1 j=1,j =i exp[s(I,Tj)], Lada = EI D [H(y(I), ˆp(I))]. |
| Open Source Code | No | The paper does not provide a direct link to a code repository or an explicit statement about releasing the source code for their proposed method. It mentions releasing 'datasets based on pseudo captions' but not the code. |
| Open Datasets | Yes | We evaluate Ada CL on two image-text matching datasets, (1) Flickr30K(Young et al., 2014) consists of 31,783 images, with a training/test/validation split of 29,783/1,000/1,000. (2) MS-COCO(Lin et al., 2014) consists of 123,287 images, with a training/test/validation split of 113,287/5,000/5,000. |
| Dataset Splits | Yes | Datasets. We evaluate Ada CL on two image-text matching datasets, (1) Flickr30K(Young et al., 2014) consists of 31,783 images, with a training/test/validation split of 29,783/1,000/1,000. (2) MS-COCO(Lin et al., 2014) consists of 123,287 images, with a training/test/validation split of 113,287/5,000/5,000. The test sets are divided into MS-COCO 5-fold 1K (average results of 5 test sets) and MS-COCO 5K (results of 5000 test images). |
| Hardware Specification | Yes | All experiments are performed on four NVIDIA Tesla V100s. |
| Software Dependencies | No | The paper mentions several frameworks and tools used (Res Net, Bi GRU, Faster R-CNN, BERT, CLIP, Adam optimizer, BLIP, GIT, BLIP-2, Co Ca) but does not provide specific version numbers for the software dependencies like Python, PyTorch, or TensorFlow that would be required for full reproducibility. |
| Experiment Setup | Yes | Training Details. All experiments are performed on four NVIDIA Tesla V100s. For image-text matching, we use a mini-batch size of 64 and the Adam optimizer. The learning rate is 0.0002 and starts decaying 15% of every 10 epochs after epoch 20. The maximum length of each sentence is a = 32. For Faster R-CNN, the region number is n = 36. The dimension of joint embedding space D is set to 256. We follow (He et al., 2020) to use the momentum memory bank, where the momentum coefficient z is set to 0.99, and the size M is 4096. For Ada CL, ˆpu is set to 0.03, and ϵ = e 7. m1 and m2 are initialized to 20 and 0.1 respectively for adaptive tuning. |