Super Deep Contrastive Information Bottleneck for Multi-modal Clustering

Authors: Zhengzheng Lou, Ke Zhang, Yucong Wu, Shizhe Hu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on 4 multi-modal datasets and the accuracy of the method on the ESP dataset improved by 9.3%. The results demonstrate the superiority and clever design of the proposed SDCIB.
Researcher Affiliation Academia 1School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China. Correspondence to: Shizhe Hu < EMAIL, https://shizhehu.github.io/>.
Pseudocode Yes Algorithm 1 Algorithm for Optimizing the proposed SDCIB
Open Source Code Yes The source code is available on https://github.com/Shizhe Hu.
Open Datasets Yes Caltech-2V(Fei-Fei et al., 2004) contains 1,440 image samples, categorized into 7 classes based on WM and CENTRIST modalities. Event (Li & Fei-Fei, 2007) encompasses 1,579 sports event image samples, divided into 8 categories based on 3 modalities: Color Attention, SIFT, and TPLBP. IAPR (Grubinger et al., 2006) includes 7,855 image samples, accompanied by natural language descriptions, and is divided into 6 categories using SIFT representation and Bo W model modalities. ESP (Von Ahn & Dabbish, 2005) sourced from a social image collection on an image annotation game website, comprises 11,032 image samples, categorized into 7 classes.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It lists the number of samples for each dataset but not how they were partitioned for experiments.
Hardware Specification No No specific hardware details (GPU, CPU, memory, etc.) are provided in the paper.
Software Dependencies No No specific software dependencies with version numbers are mentioned in the paper. The paper mentions using the Adam optimizer but does not specify its version or the versions of any other libraries or programming languages.
Experiment Setup Yes The entire training process of the experiment is completed within 40 epochs, with a batch size of 32. The proposed SDCIB consists of M modality-specific encoders, 4 M mutual information estimators, and M clustering layers. Each modality-specific encoder contains 4 fully connected layers with dimensions of 1024, 1024, 1024, and 128, respectively. Each fully connected layer is followed by a Batch Norm layer for representation normalization and a ReLU layer as the activation function. The clustering layer consists of a fully connected layer and a softmax layer to obtain the final clustering results. Meanwhile, we use the Adam optimizer for parameter optimization, with an initial learning rate set to 0.0001.