Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment
Authors: Yang Liu, Mengyuan Liu, Shudong Huang, Jiancheng Lv
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods. We evaluate AVSE with existing fine-grained methods on diverse model backbones. AVSE outperforms the latest state-of-the-art methods on image text retrieval on Flickr30K and MS-COCO, and is also significantly faster than Local-level Matching Methods. Ablation Study To verify the effectiveness of each component of our AVSE, we conduct extensive ablation studies on Flickr30K dataset. |
| Researcher Affiliation | Academia | Yang Liu1, 2, Mengyuan Liu3*, Shudong Huang1, 2, Jiancheng Lv1, 2 1 College of Computer Science, Sichuan University, Chengdu, 610065, China 2 Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, China 3 State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, Shenzhen, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in detail using textual explanations and mathematical formulas, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/liuyyy111/AVSE |
| Open Datasets | Yes | Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods. Following previous works (Faghri et al. 2018), we use two widely used benchmark datasets MS-COCO (Lin et al. 2014) and Flickr30K (Plummer et al. 2015) for our experiment. |
| Dataset Splits | Yes | MS-COCO is a dataset that contains 123287 images and five text captions are annotated for each image. Following (Faghri et al. 2018), all data are split into training set, validation set and testing set which contains 113287, 5000, 5000 images respectively. Flickr30K is composed of 31783 images and each image has 5 corresponding descriptions. We follow the split in (Faghri et al. 2018), using 29000 images for training, 1000 images for validation, and 1000 images for testing. |
| Hardware Specification | No | The paper mentions running experiments "on a single GPU" but does not provide specific details about the GPU model or any other hardware specifications. |
| Software Dependencies | No | The paper mentions using "Adam W (Loshchilov and Hutter 2019) optimizer" but does not specify version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | For feature extractions, we use the conventional region features and also use the recent popular Vision Transformer as the backbone, e.g., Vi T (Dosovitskiy et al. 2020), Swin Transformer (Liu et al. 2021b)we set the dimension of the shared embedding space d1 as 512. For the asymmetric embedding optimal matching module, we set the dimension of blocks d2 as 256. For objective function, we set λ1 as 1 d1 1 to balance two terms of Lreg and α is set to 0.2 as margin parameter. We train the proposed model for 25 epochs with set the mini-batch size as 128 using Adam W (Loshchilov and Hutter 2019) optimizer. The learning rate is set as 0.0005 for the first epochs and then decreased to 0.00005 for the rest 10 epochs. |