Towards Out-of-Modal Generalization without Instance-level Modal Correspondence

Authors: Zhuo Huang, Gang Niu, Bo Han, Masashi Sugiyama, Tongliang Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carefully evaluate the proposed COX method under various OOM generalization scenarios, verifying its effectiveness and extensibility. The code is available at https://github.com/tmllab/2025 ICLR COX. ... 4 EXPERIMENTS In our experiments, we first elucidate the experimental details. Then, we provide performance comparisons to various baseline methods on different datasets. Finally, we conduct empirical analyses to provide an intuitive understanding of the proposed method.
Researcher Affiliation Academia Zhuo Huang1, Gang Niu2, Bo Han3,2, Masashi Sugiyama2,4, Tongliang Liu1,2, 1Sydney AI Centre, The University of Sydney 2RIKEN Center for Advanced Intelligence Project 3Hong Kong Baptist University 4The University of Tokyo Corresponding to Tongliang Liu (EMAIL).
Pseudocode No The paper includes mathematical formulations, theorems, and descriptions of the method but does not contain explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The code is available at https://github.com/tmllab/2025 ICLR COX.
Open Datasets Yes Datasets. We consider datasets with at least three modalities: 1) TVL dataset (Fu et al., 2024) contains tactile sensing, RGB image, and class name which can be transformed into language; 2) LLVIP (Jia et al., 2021) dataset has infrared thermal data, RGB image, and annotations for pedestrian detection. We follow Zhu et al. (2023) to crop the pedestrian and background which stand for two classes. Further, we use the Open AI template (Radford et al., 2021) to create language description; 3) NYU-D dataset (Silberman et al., 2012) contains RGB image, depth data, and class name that can be transformed into language description as well; 4) VGGS dataset (Chen et al., 2020a) includes video data, corresponding sound, and the language description; 5) MSR-VTT (Xu et al., 2016) includes videos and text description, we break down the videos into video frames and the audio data; 6) MOSEI dataset (Zadeh et al., 2018) contains videos from 7 classes of emotions, we extract audio data from the videos and use the emotion type to create language descriptions.
Dataset Splits Yes Setup. We consider two scenarios of OOM generalization: For the semi-supervised case, we sample 10% of the training data as labeled data with each class having a balanced number of labels. For the unsupervised case, we have no labels at all. For selecting the number of anchor points, we choose the same number of examples for the warm-up and training phases, which is 10% of the total training set.
Hardware Specification Yes Discussion on computational efficiency. Note that we conduct the feature connection mostly on the feature space, the computational cost of training VIB framework work is quite acceptable. The main cost is training the OOM learner which is the basic training with cross-entropy loss optimization and can be implemented on a single NVIDIA 3090/4090 GPU.
Software Dependencies No The paper mentions using the "Adam optimizer" but does not specify its version or other software dependencies with version numbers (e.g., specific deep learning frameworks like PyTorch or TensorFlow, and their versions, or Python version).
Experiment Setup Yes Setup. ... To train the OOM learner, we use the Adam optimizer with an initial learning rate of 1e 3 with weight decay 1e 5, and train the model for 50 epochs.