reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic Adapter with Semantics Disentangling for Cross-lingual Cross-modal Retrieval

Authors: Rui Cai, Zhiyu Dong, Jianfeng Dong, Xun Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on two image-text datasets and one video-text dataset demonstrate the effectiveness of our model for cross-lingual cross-modal retrieval, as well as its good compatibility with various VLP models. Experiments Experimental Settings Datasets. Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper.
Researcher Affiliation	Academia	Rui Cai1,2, Zhiyu Dong1,2, Jianfeng Dong1,2*, Xun Wang1,2 1 the College of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, China 2 Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the proposed method using textual explanations and mathematical equations, but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Hui Guan Lab/DASD
Open Datasets	Yes	Evaluations are performed on two image-text retrieval datasets (Multi30K (Elliott et al. 2016) and MSCOCO (Chen et al. 2015)) and a video-text retrieval dataset (MSRVTT (Xu et al. 2016)), referred as downstream task datasets (DTD) in this paper. Besides, the webscraped image-caption dataset CC3M (Sharma et al. 2018) with machine-translated captions is also used for training, from which 300k image-captions pairs are randomly selected and known as CC300K (Zhang, Hu, and Jin 2022).
Dataset Splits	Yes	Under the Cross-lingual Finetune setting, image-caption pairs for target languages are obtained in two separate ways: (1) we directly leverage the target-language data in CC300K (following MLA (Zhang, Hu, and Jin 2022)). (2) English captions in Multi30K and MSCOCO are converted into target languages utilizing Google Translate (following CL2CM (Wang et al. 2024a)). Finally, models are tested on DTD target-language datasets. ... For cross-lingual video-text retrieval, experiments are conducted on MSRVTT (Xu et al. 2016) under the same settings with cross-lingual image-text retrieval, where the model searches for the most semantically relevant videos given a text query in a low-resource language.
Hardware Specification	No	The paper mentions challenges related to
Software Dependencies	No	The paper does not explicitly mention specific software dependencies or their version numbers used for the experiments.
Experiment Setup	No	Our model is trained by minimizing the combination of the above losses. Finally, the total loss function is defined as: L = LCL + LCM + λ1Ladv + λ2Lsc (16) where λ1 and λ2 are hyper-parameters to balance the importance of disentangling losses. ... where B is the batch size, sim( ) denotes the similarity function (i.e., cosine similarity) and τ is the temperature coefficient. ... for both settings, we simply adopt the same hyperparameter values and training strategy used for the cross-lingual image-text retrieval.