Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval
Authors: Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models. Experiments Experimental Settings Datasets. Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper. |
| Researcher Affiliation | Academia | Rui Cai1,2, Zhiyu Dong1,2, Jianfeng Dong1,2*, Xun Wang1,2 1 the College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China 2 Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method using textual explanations and mathematical equations, but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/Hui Guan Lab/DASD |
| Open Datasets | Yes | Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper. Besides, the webscraped image-caption dataset CC3M (Sharma et al. 2018) with machine-translated captions is also used for training, from which 300k image-captions pairs are randomly selected and known as CC300K (Zhang, Hu, and Jin 2022). |
| Dataset Splits | Yes | Under the Cross-lingual Finetune setting, image-caption pairs for target languages are obtained in two separate ways: (1) we directly leverage the target-language data in CC300K (following MLA (Zhang, Hu, and Jin 2022)). (2) English captions in Multi30K and MSCOCO are converted into target languages utilizing Google Translate (following CL2CM (Wang et al. 2024a)). Finally, models are tested on DTD target-language datasets. ... For cross-lingual video-text retrieval, experiments are conducted on MSRVTT (Xu et al. 2016) under the same settings with cross-lingual image-text retrieval, where the model searches for the most semantically relevant videos given a text query in a low-resource language. |
| Hardware Specification | No | The paper mentions challenges related to |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies or their version numbers used for the experiments. |
| Experiment Setup | No | Our model is trained by minimizing the combination of the above losses. Finally, the total loss function is defined as: L = LCL + LCM + λ1Ladv + λ2Lsc (16) where λ1 and λ2 are hyper-parameters to balance the importance of disentangling losses. ... where B is the batch size, sim( ) denotes the similarity function (i.e., cosine similarity) and τ is the temperature coefficient. ... for both settings, we simply adopt the same hyperparameter values and training strategy used for the cross-lingual image-text retrieval. |