Vision-guided Text Mining for Unsupervised Cross-modal Hashing with Community Similarity Quantization
Authors: Haozhi Fan, Yuan Cao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines. |
| Researcher Affiliation | Academia | Haozhi Fan1, Yuan Cao*2 1 School of Engineering and Applied Science, University of Pennsylvania, USA 2 School of Computer Science and Technology, Ocean University of China, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology in paragraph form, such as in the 'Vision-guided Text Mining' and 'Image Feature Extraction' sections, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/louisfanhz/VTMUCH |
| Open Datasets | Yes | MIRFLICKR-25K (Huiskes and Lew 2008) and NUS-WIDE (Chua et al. 2009) |
| Dataset Splits | Yes | For MIRFLICKR-25K, we randomly sample 2,000 pairs of images and captions as the query set, and sample 5,000 pairs from the remaining data as the training set. All data not in the query set is used as retrieval set. ... For NUS-WIDE, out of which 2,000 pairs are randomly sampled as the query set, 5,000 pairs are sampled from the remaining data as the training set. All data not in the query set is used as retrieval set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. It only describes the general experimental setup and parameters. |
| Software Dependencies | No | The paper mentions using 'pre-trained CLIP model' and 'Faster R-CNN' and 'Leiden algorithm' but does not specify any software versions for libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For the training task, the input image is resized to 224 224, and the input text is split into individual words, each as a separate input. We use the pre-trained CLIP model as the backbone during initialization. While fine-tuning CLIP backbone, we only fine-tune the text encoder, but freeze the parameters of the image encoder. The output of the last layer is a 512-dimensional vector for both the image encoder and text encoder. Fully connected layers followed by a tanh activation are used to learn hashing functions. We provide the parameters configuration used for the two datasets here. As for MIRFLICKR-25K, Nc = 14, as for NUS-WIDE, Nc = 22. All the other parameters are the same across the two datasets. τ = 0.3, ρ = 0.1, Nw = 64, α = 1. |