DiCA: Disambiguated Contrastive Alignment for Cross-Modal Retrieval with Partial Labels

Authors: Chao Su, Huiming Zheng, Dezhong Peng, Xu Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on four benchmarks validate the effectiveness of our proposed method, which demonstrates enhanced performance over existing state-of-the-art methods.
Researcher Affiliation Collaboration 1The College of Computer Science, Sichuan University, Chengdu, China 2Sichuan National Innovation New Vision UHD Video Technology Co., Ltd., Chengdu, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but it does not include a clearly labeled pseudocode block or algorithm section.
Open Source Code Yes Code https://github.com/Rose-bud/Di CA.
Open Datasets Yes To evaluate the effectiveness of our method, we conduct extensive comparison experiments on four cross-modal retrieval benchmark datasets. These datasets are introduced as follows: 1) Wikipedia contains 2,866 image-text pairs... 2) INRIA-Websearch consists of 71,478 images and 71,478 text descriptions... 3) NUS-WIDE consists of about 270,000 images... 4) XMedia Net is a large-scale multimodal dataset...
Dataset Splits Yes 1) Wikipedia contains 2,866 image-text pairs that belong to 10 classes. We follow the previous work (Feng, Wang, and Li 2014) divide the dataset into 3 subsets: 2,173, 231, and 462 pairs for training, validation, and testing sets, respectively. 2) INRIA-Websearch... divide the dataset into three subsets: 9,000, 1,332 and 4,366 image-text pairs for training, validation and testing sets, respectively. 3) NUS-WIDE... split the dataset into three subsets, i.e., 42,941; 5,000; and 23,661 image-text pairs for training, validation, and testing sets, respectively. 4) XMedia Net... divide them into 32,000, 4,000, and 4,000 pairs for training, validation, and testing sets, respectively.
Hardware Specification Yes Our Di CA is implemented on the Py Torch framework, and all experiments are conducted on four Nvidia Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions 'Py Torch framework', 'Adam' optimizer, 'VGG-19', 'Doc2Vec', 'Alex Net', and 'LDA' as models or tools used, but specific version numbers for these software components are not provided.
Experiment Setup Yes In this work, we adopt the Adam (Kingma and Ba 2014) optimizer with a learning rate 0.0001 to update the parameters. For all datasets, we set the maximum number of training epochs to 100. The training batch size is set to 32 for Wikipedia dataset, and to 512 for the other datasets. Furthermore, to maintain consistency, the batch size during validation and testing is uniformly set to 256.