Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples
Authors: Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension. Experimental results on downstream fine-grained multi-modal tasks demonstrate NAS s superior performance, significantly outperforming existing VLP models. The paper also includes detailed tables of results on benchmarks like ARO, Winoground, and VALSE. |
| Researcher Affiliation | Collaboration | 1School of Automation, Northwestern Polytechnical University, Xi an, Shaanxi, China 2School of Cybersecurity, Northwestern Polytechnical University, Xi an, Shaanxi, China 3Alibaba Group, Hangzhou, Zhejiang, China 4School of Automation, Southeast University, Nanjing, Jiangsu, China |
| Pseudocode | No | The paper describes the model architecture and the Negative Visual Augmentation Module with mathematical equations and descriptive text, accompanied by figures illustrating the framework. However, it does not include a dedicated section or block explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing the source code for the proposed NAS method, nor does it provide a direct link to a code repository. It mentions leveraging existing frameworks like ALBEF and BLIP but not providing its own implementation code. |
| Open Datasets | Yes | We use COCO (Lin et al. 2014), Visual Genome (VG) (Krishna et al. 2017), Conceptual Captions (CC) (Sharma et al. 2018), and SBU Captions (Ordonez, Kulkarni, and Berg 2011) as our pretraining datasets, which have a total of 4 million unique images and 5.1 million image-text pairs. |
| Dataset Splits | No | The paper uses standard public datasets (COCO, Visual Genome, Conceptual Captions, SBU Captions) for pretraining and specific benchmarks (ARO, Winoground, VALSE) for evaluation. While these datasets typically have predefined splits, the paper does not explicitly detail the training, validation, and test splits used for its own experimental setup, beyond mentioning the total number of images and texts available for pretraining and fine-tuning on a text-augmented COCO dataset. |
| Hardware Specification | Yes | All experiments are performed on 8 NVIDIA A800 GPUs. |
| Software Dependencies | No | The paper mentions using BERTbase to initialize the text encoder, DEiT-224/16 to initialize the image encoder, and the AdamW optimizer. However, it does not specify version numbers for these or any other software libraries (e.g., Python, PyTorch/TensorFlow, CUDA) used in the implementation. |
| Experiment Setup | Yes | Pretraining unfolds over 29 epochs in the first stage and a single epoch in the second stage, utilizing a batch size of 512. We adopt the Adam W optimizer with a weight decay of 0.02. In the first 1000 iterations, the learning rate is warmed-up to 1e 4, and decayed to 1e 5 following a cosine schedule. Each image is randomly cropped to 256 256 resolution, and Rand Augment (Cubuk et al. 2020) is adopted. During the fine-tuning stage, the resolution of an image is up-scaled to 384 384, and the positional encoding of the image patches is interpolated. The momentum parameter for updating the momentum model is 0.995, and the queue length of cached features for ITC task is set as 65, 536. We linearly ramp-up the distillation weight α from 0 to 0.4 within the 1st epoch. |