DiCA: Disambiguated Contrastive Alignment for Cross-Modal Retrieval with Partial Labels
Authors: Chao Su, Huiming Zheng, Dezhong Peng, Xu Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on four benchmarks validate the effectiveness of our proposed method, which demonstrates enhanced performance over existing state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1The College of Computer Science, Sichuan University, Chengdu, China 2Sichuan National Innovation New Vision UHD Video Technology Co., Ltd., Chengdu, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but it does not include a clearly labeled pseudocode block or algorithm section. |
| Open Source Code | Yes | Code https://github.com/Rose-bud/Di CA. |
| Open Datasets | Yes | To evaluate the effectiveness of our method, we conduct extensive comparison experiments on four cross-modal retrieval benchmark datasets. These datasets are introduced as follows: 1) Wikipedia contains 2,866 image-text pairs... 2) INRIA-Websearch consists of 71,478 images and 71,478 text descriptions... 3) NUS-WIDE consists of about 270,000 images... 4) XMedia Net is a large-scale multimodal dataset... |
| Dataset Splits | Yes | 1) Wikipedia contains 2,866 image-text pairs that belong to 10 classes. We follow the previous work (Feng, Wang, and Li 2014) divide the dataset into 3 subsets: 2,173, 231, and 462 pairs for training, validation, and testing sets, respectively. 2) INRIA-Websearch... divide the dataset into three subsets: 9,000, 1,332 and 4,366 image-text pairs for training, validation and testing sets, respectively. 3) NUS-WIDE... split the dataset into three subsets, i.e., 42,941; 5,000; and 23,661 image-text pairs for training, validation, and testing sets, respectively. 4) XMedia Net... divide them into 32,000, 4,000, and 4,000 pairs for training, validation, and testing sets, respectively. |
| Hardware Specification | Yes | Our Di CA is implemented on the Py Torch framework, and all experiments are conducted on four Nvidia Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch framework', 'Adam' optimizer, 'VGG-19', 'Doc2Vec', 'Alex Net', and 'LDA' as models or tools used, but specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | In this work, we adopt the Adam (Kingma and Ba 2014) optimizer with a learning rate 0.0001 to update the parameters. For all datasets, we set the maximum number of training epochs to 100. The training batch size is set to 32 for Wikipedia dataset, and to 512 for the other datasets. Furthermore, to maintain consistency, the batch size during validation and testing is uniformly set to 256. |