reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Authors: Yang Liu, Mengyuan Liu, Shudong Huang, Jiancheng Lv

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods. We evaluate AVSE with existing fine-grained methods on diverse model backbones. AVSE outperforms the latest state-of-the-art methods on image text retrieval on Flickr30K and MS-COCO, and is also significantly faster than Local-level Matching Methods. Ablation Study To verify the effectiveness of each component of our AVSE, we conduct extensive ablation studies on Flickr30K dataset.
Researcher Affiliation	Academia	Yang Liu1, 2, Mengyuan Liu3*, Shudong Huang1, 2, Jiancheng Lv1, 2 1 College of Computer Science, Sichuan University, Chengdu, 610065, China 2 Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, China 3 State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in detail using textual explanations and mathematical formulas, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/liuyyy111/AVSE
Open Datasets	Yes	Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods. Following previous works (Faghri et al. 2018), we use two widely used benchmark datasets MS-COCO (Lin et al. 2014) and Flickr30K (Plummer et al. 2015) for our experiment.
Dataset Splits	Yes	MS-COCO is a dataset that contains 123287 images and five text captions are annotated for each image. Following (Faghri et al. 2018), all data are split into training set, validation set and testing set which contains 113287, 5000, 5000 images respectively. Flickr30K is composed of 31783 images and each image has 5 corresponding descriptions. We follow the split in (Faghri et al. 2018), using 29000 images for training, 1000 images for validation, and 1000 images for testing.
Hardware Specification	No	The paper mentions running experiments "on a single GPU" but does not provide specific details about the GPU model or any other hardware specifications.
Software Dependencies	No	The paper mentions using "Adam W (Loshchilov and Hutter 2019) optimizer" but does not specify version numbers for any software libraries, programming languages, or other dependencies.
Experiment Setup	Yes	For feature extractions, we use the conventional region features and also use the recent popular Vision Transformer as the backbone, e.g., Vi T (Dosovitskiy et al. 2020), Swin Transformer (Liu et al. 2021b)we set the dimension of the shared embedding space d1 as 512. For the asymmetric embedding optimal matching module, we set the dimension of blocks d2 as 256. For objective function, we set λ1 as 1 d1 1 to balance two terms of Lreg and α is set to 0.2 as margin parameter. We train the proposed model for 25 epochs with set the mini-batch size as 128 using Adam W (Loshchilov and Hutter 2019) optimizer. The learning rate is set as 0.0005 for the first epochs and then decreased to 0.00005 for the rest 10 epochs.