Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

Authors: Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models. Experiments Experimental Settings Datasets. Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper.
Researcher Affiliation Academia Rui Cai1,2, Zhiyu Dong1,2, Jianfeng Dong1,2*, Xun Wang1,2 1 the College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China 2 Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the proposed method using textual explanations and mathematical equations, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/Hui Guan Lab/DASD
Open Datasets Yes Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper. Besides, the webscraped image-caption dataset CC3M (Sharma et al. 2018) with machine-translated captions is also used for training, from which 300k image-captions pairs are randomly selected and known as CC300K (Zhang, Hu, and Jin 2022).
Dataset Splits Yes Under the Cross-lingual Finetune setting, image-caption pairs for target languages are obtained in two separate ways: (1) we directly leverage the target-language data in CC300K (following MLA (Zhang, Hu, and Jin 2022)). (2) English captions in Multi30K and MSCOCO are converted into target languages utilizing Google Translate (following CL2CM (Wang et al. 2024a)). Finally, models are tested on DTD target-language datasets. ... For cross-lingual video-text retrieval, experiments are conducted on MSRVTT (Xu et al. 2016) under the same settings with cross-lingual image-text retrieval, where the model searches for the most semantically relevant videos given a text query in a low-resource language.
Hardware Specification No The paper mentions challenges related to
Software Dependencies No The paper does not explicitly mention specific software dependencies or their version numbers used for the experiments.
Experiment Setup No Our model is trained by minimizing the combination of the above losses. Finally, the total loss function is defined as: L = LCL + LCM + λ1Ladv + λ2Lsc (16) where λ1 and λ2 are hyper-parameters to balance the importance of disentangling losses. ... where B is the batch size, sim( ) denotes the similarity function (i.e., cosine similarity) and τ is the temperature coefficient. ... for both settings, we simply adopt the same hyperparameter values and training strategy used for the cross-lingual image-text retrieval.