reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision-guided Text Mining for Unsupervised Cross-modal Hashing with Community Similarity Quantization

Authors: Haozhi Fan, Yuan Cao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.
Researcher Affiliation	Academia	Haozhi Fan1, Yuan Cao*2 1 School of Engineering and Applied Science, University of Pennsylvania, USA 2 School of Computer Science and Technology, Ocean University of China, China EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology in paragraph form, such as in the 'Vision-guided Text Mining' and 'Image Feature Extraction' sections, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/louisfanhz/VTMUCH
Open Datasets	Yes	MIRFLICKR-25K (Huiskes and Lew 2008) and NUS-WIDE (Chua et al. 2009)
Dataset Splits	Yes	For MIRFLICKR-25K, we randomly sample 2,000 pairs of images and captions as the query set, and sample 5,000 pairs from the remaining data as the training set. All data not in the query set is used as retrieval set. ... For NUS-WIDE, out of which 2,000 pairs are randomly sampled as the query set, 5,000 pairs are sampled from the remaining data as the training set. All data not in the query set is used as retrieval set.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. It only describes the general experimental setup and parameters.
Software Dependencies	No	The paper mentions using 'pre-trained CLIP model' and 'Faster R-CNN' and 'Leiden algorithm' but does not specify any software versions for libraries, frameworks, or programming languages used.
Experiment Setup	Yes	For the training task, the input image is resized to 224 224, and the input text is split into individual words, each as a separate input. We use the pre-trained CLIP model as the backbone during initialization. While fine-tuning CLIP backbone, we only fine-tune the text encoder, but freeze the parameters of the image encoder. The output of the last layer is a 512-dimensional vector for both the image encoder and text encoder. Fully connected layers followed by a tanh activation are used to learn hashing functions. We provide the parameters configuration used for the two datasets here. As for MIRFLICKR-25K, Nc = 14, as for NUS-WIDE, Nc = 22. All the other parameters are the same across the two datasets. τ = 0.3, ρ = 0.1, Nw = 64, α = 1.