Vision-guided Text Mining for Unsupervised Cross-modal Hashing with Community Similarity Quantization

Authors: Haozhi Fan, Yuan Cao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.
Researcher Affiliation Academia Haozhi Fan1, Yuan Cao*2 1 School of Engineering and Applied Science, University of Pennsylvania, USA 2 School of Computer Science and Technology, Ocean University of China, China EMAIL, EMAIL
Pseudocode No The paper describes its methodology in paragraph form, such as in the 'Vision-guided Text Mining' and 'Image Feature Extraction' sections, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/louisfanhz/VTMUCH
Open Datasets Yes MIRFLICKR-25K (Huiskes and Lew 2008) and NUS-WIDE (Chua et al. 2009)
Dataset Splits Yes For MIRFLICKR-25K, we randomly sample 2,000 pairs of images and captions as the query set, and sample 5,000 pairs from the remaining data as the training set. All data not in the query set is used as retrieval set. ... For NUS-WIDE, out of which 2,000 pairs are randomly sampled as the query set, 5,000 pairs are sampled from the remaining data as the training set. All data not in the query set is used as retrieval set.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. It only describes the general experimental setup and parameters.
Software Dependencies No The paper mentions using 'pre-trained CLIP model' and 'Faster R-CNN' and 'Leiden algorithm' but does not specify any software versions for libraries, frameworks, or programming languages used.
Experiment Setup Yes For the training task, the input image is resized to 224 224, and the input text is split into individual words, each as a separate input. We use the pre-trained CLIP model as the backbone during initialization. While fine-tuning CLIP backbone, we only fine-tune the text encoder, but freeze the parameters of the image encoder. The output of the last layer is a 512-dimensional vector for both the image encoder and text encoder. Fully connected layers followed by a tanh activation are used to learn hashing functions. We provide the parameters configuration used for the two datasets here. As for MIRFLICKR-25K, Nc = 14, as for NUS-WIDE, Nc = 22. All the other parameters are the same across the two datasets. τ = 0.3, ρ = 0.1, Nw = 64, α = 1.