reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Heterogeneous Prompt-Guided Entity Inferring and Distilling for Scene-Text Aware Cross-Modal Retrieval

Authors: Zhiqian Zhao, Liang Li, Jiehua Zhang, Yaoqi Sun, Xichun Sheng, Haibing Yin, Shaowei Jiang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that the proposed method significantly outperforms existing approaches on two public cross-modal retrieval benchmarks. We conduct extensive experiments on two public benchmarks and achieve a new state-of-the-art performance.
Researcher Affiliation	Academia	1Hangzhou Dianzi University, Hangzhou, China, 2Institute of Computing Technology, Chinese Academy of Sciences, 3School of Software Engineering, Xi an Jiaotong University, 4Macao Polytechnic University, Macao, China, 5Lishui Institute of Hangzhou Dianzi University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods using text and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Demo https://my-hopid.github.io. This link provides a demonstration page, not a direct repository for the source code of the methodology described in the paper.
Open Datasets	Yes	We conduct the experiments on two cross-modal retrieval datasets: COCO-Text Captioned (CTC) (Mafla et al. 2021) dataset and Text Caps (Sidorov et al. 2020) dataset.
Dataset Splits	Yes	CTC contains two test sets, CTC-1K and CTC-5K. For fair comparisons, we strictly follow its previous split. On Text Caps, following before SOTA method (Miyawaki et al. 2022), we use 21,953 images for training and 3,166 images for testing.
Hardware Specification	Yes	Furthermore, the batch size is set to 300 and the model is trained and evaluated on one RTX 4090 GPU for 30 epochs.
Software Dependencies	No	The paper mentions using "paddle OCR" and "pre-trained BERT", and "frozen visual encoder of CLIP" but does not specify their version numbers or other software library versions. It also mentions "Adam optimizer" without a specific version.
Experiment Setup	Yes	We set the number of iterations of PED t = 2, the dimension of each slot and OCR feature as D = 2048. Following Mafla et al. (2021), we set the maximum number of OCR tokens N = 20 and the maximum number of objects M = 36. Furthermore, the batch size is set to 300 and the model is trained and evaluated on one RTX 4090 GPU for 30 epochs. Adam optimizer is used with β1 = 0.9, β2 = 0.999, ϵ = 10 9, and a learning rate of 2e 4.