reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Authors: Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang, Xiaoyan Cai

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension. Experimental results on downstream fine-grained multi-modal tasks demonstrate NAS s superior performance, significantly outperforming existing VLP models. The paper also includes detailed tables of results on benchmarks like ARO, Winoground, and VALSE.
Researcher Affiliation	Collaboration	1School of Automation, Northwestern Polytechnical University, Xi an, Shaanxi, China 2School of Cybersecurity, Northwestern Polytechnical University, Xi an, Shaanxi, China 3Alibaba Group, Hangzhou, Zhejiang, China 4School of Automation, Southeast University, Nanjing, Jiangsu, China
Pseudocode	No	The paper describes the model architecture and the Negative Visual Augmentation Module with mathematical equations and descriptive text, accompanied by figures illustrating the framework. However, it does not include a dedicated section or block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper does not contain any explicit statements about releasing the source code for the proposed NAS method, nor does it provide a direct link to a code repository. It mentions leveraging existing frameworks like ALBEF and BLIP but not providing its own implementation code.
Open Datasets	Yes	We use COCO (Lin et al. 2014), Visual Genome (VG) (Krishna et al. 2017), Conceptual Captions (CC) (Sharma et al. 2018), and SBU Captions (Ordonez, Kulkarni, and Berg 2011) as our pretraining datasets, which have a total of 4 million unique images and 5.1 million image-text pairs.
Dataset Splits	No	The paper uses standard public datasets (COCO, Visual Genome, Conceptual Captions, SBU Captions) for pretraining and specific benchmarks (ARO, Winoground, VALSE) for evaluation. While these datasets typically have predefined splits, the paper does not explicitly detail the training, validation, and test splits used for its own experimental setup, beyond mentioning the total number of images and texts available for pretraining and fine-tuning on a text-augmented COCO dataset.
Hardware Specification	Yes	All experiments are performed on 8 NVIDIA A800 GPUs.
Software Dependencies	No	The paper mentions using BERTbase to initialize the text encoder, DEiT-224/16 to initialize the image encoder, and the AdamW optimizer. However, it does not specify version numbers for these or any other software libraries (e.g., Python, PyTorch/TensorFlow, CUDA) used in the implementation.
Experiment Setup	Yes	Pretraining unfolds over 29 epochs in the first stage and a single epoch in the second stage, utilizing a batch size of 512. We adopt the Adam W optimizer with a weight decay of 0.02. In the first 1000 iterations, the learning rate is warmed-up to 1e 4, and decayed to 1e 5 following a cosine schedule. Each image is randomly cropped to 256 256 resolution, and Rand Augment (Cubuk et al. 2020) is adopted. During the fine-tuning stage, the resolution of an image is up-scaled to 384 384, and the positional encoding of the image patches is interpolated. The momentum parameter for updating the momentum model is 0.995, and the queue length of cached features for ITC task is set as 65, 536. We linearly ramp-up the distillation weight α from 0 to 0.4 within the 1st epoch.