Heterogeneous Prompt-Guided Entity Inferring and Distilling for Scene-Text Aware Cross-Modal Retrieval

Authors: Zhiqian Zhao, Liang Li, Jiehua Zhang, Yaoqi Sun, Xichun Sheng, Haibing Yin, Shaowei Jiang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed method significantly outperforms existing approaches on two public cross-modal retrieval benchmarks. We conduct extensive experiments on two public benchmarks and achieve a new state-of-the-art performance.
Researcher Affiliation Academia 1Hangzhou Dianzi University, Hangzhou, China, 2Institute of Computing Technology, Chinese Academy of Sciences, 3School of Software Engineering, Xi an Jiaotong University, 4Macao Polytechnic University, Macao, China, 5Lishui Institute of Hangzhou Dianzi University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods using text and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Demo https://my-hopid.github.io. This link provides a demonstration page, not a direct repository for the source code of the methodology described in the paper.
Open Datasets Yes We conduct the experiments on two cross-modal retrieval datasets: COCO-Text Captioned (CTC) (Mafla et al. 2021) dataset and Text Caps (Sidorov et al. 2020) dataset.
Dataset Splits Yes CTC contains two test sets, CTC-1K and CTC-5K. For fair comparisons, we strictly follow its previous split. On Text Caps, following before SOTA method (Miyawaki et al. 2022), we use 21,953 images for training and 3,166 images for testing.
Hardware Specification Yes Furthermore, the batch size is set to 300 and the model is trained and evaluated on one RTX 4090 GPU for 30 epochs.
Software Dependencies No The paper mentions using "paddle OCR" and "pre-trained BERT", and "frozen visual encoder of CLIP" but does not specify their version numbers or other software library versions. It also mentions "Adam optimizer" without a specific version.
Experiment Setup Yes We set the number of iterations of PED t = 2, the dimension of each slot and OCR feature as D = 2048. Following Mafla et al. (2021), we set the maximum number of OCR tokens N = 20 and the maximum number of objects M = 36. Furthermore, the batch size is set to 300 and the model is trained and evaluated on one RTX 4090 GPU for 30 epochs. Adam optimizer is used with β1 = 0.9, β2 = 0.999, ϵ = 10 9, and a learning rate of 2e 4.