Multi-aspect Self-guided Deep Information Bottleneck for Multi-modal Clustering
Authors: Shizhe Hu, Jiahao Fan, Guoliang Zou, Yangdong Ye
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results demonstrate that our method outperforms state-of-the-art multi-modal clustering methods, showcasing its superior performance and broad application prospects. |
| Researcher Affiliation | Academia | School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: MSDIB Algorithm Input: Multi-modal dataset {Xi}m i=1; number of clusters k. Parameter: Hyperparameters α, β, and learning rate γ. Output: The label predictor C. |
| Open Source Code | Yes | Code https://github.com/Shizhe Hu |
| Open Datasets | Yes | Caltech-2V (Fei-Fei, Fergus, and Perona 2004) consists of images from 7 categories, totaling 1440 images. It has the feature of Wavelet moments (Shen and Ip 1999) and CENsus TRansform h ISTogram (CENTRIST) (Wu and Rehg 2010), where each kind of feature is regarded as a modal. ESP-Game (Von Ahn and Dabbish 2004) comprises 11,032 images, consisting of 7 categories. The image features and the corresponding text description are used as two modalities. IAPR (Grubinger et al. 2006) is an image collection with semantic descriptions, consisting of 20,000 images and their corresponding textual descriptions. For this study, a total of 7,855 images with labels no less than 4 were selected and categorized into 6 classes. It utilizes the same two modalities as ESP-Game. MIRFlickr (Huiskes and Lew 2008) comprises a total of 12,154 images across 6 categories after denoising. It utilizes the same two modalities as ESP-Game. NUS-Wide (Chua et al. 2009) contains 20,000 images over 8 classes. It comprises a total of two modalities, including both image and text. |
| Dataset Splits | No | The paper lists several well-known multi-modal datasets and describes their contents, but it does not specify any training/validation/test splits, percentages, or methodology used for partitioning these datasets in their experiments. |
| Hardware Specification | Yes | We implemented the framework in Py Torch 1.13.0 on Windows 10 with a 24 GB NVIDIA RTX-3090 GPU and i7-12700F CPU. |
| Software Dependencies | Yes | We implemented the framework in Py Torch 1.13.0 on Windows 10 with a 24 GB NVIDIA RTX-3090 GPU and i7-12700F CPU. |
| Experiment Setup | Yes | Training converged within 100 epochs. We ran the model 20 times, selecting the highest accuracy at the lowest loss to prevent local maxima. The batch size was 100, using Adam with a learning rate of 0.0001. Grid search optimized trade-off parameters α and β in (0, 1) with a step size of 0.1. |