Can One Modality Model Synergize Training of Other Modality Models?

Authors: Jae-Jun Lee, Sung Whan Yoon

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As proofs of concept, we broadly confirm the considerable gains from the synergy across visual, language, and audio models. ... In this section, we provide an overview of the experimental results, along with detailed descriptions of the datasets, models and additional experimental settings. ... Our main results are two parts: Table 1 with Image Net-1K and Table 2 with multimodal datasets, i.e., IEMOCAP and AVMNIST.
Researcher Affiliation Academia Jae-Jun Lee1, Sung Whan Yoon1,2 1Graduate School of Artificial Intelligence and 2Department of Electrical Engineering Ulsan National Institute of Science and Technology (UNIST) EMAIL
Pseudocode Yes Algorithm 1 Traininig Procedures for [Mj Mi]
Open Source Code Yes The code is available at https://github.com/johnjaejunlee95/synergistic-multimodal.
Open Datasets Yes Datasets For the main experiments, we test on the Image Net-1K dataset (Krizhevsky et al., 2012) for visual tasks as the case of [L V]. For further experiments in the multimodal setting, we employ the IEMOCAP (Busso et al., 2008) and AVMNIST (Liang et al., 2021; Li et al., 2023) datasets.
Dataset Splits Yes Datasets For the main experiments, we test on the Image Net-1K dataset (Krizhevsky et al., 2012) for visual tasks as the case of [L V]. ... Image Net-1K (Krizhevsky et al., 2012) is the image datasets that contains 1000 classes with 1,281,167 training images and 50,000 validation images.
Hardware Specification Yes Furthermore, we utilized Automatic Mixed Precision (Micikevicius et al., 2018) in conjunction with 4 A6000 GPUs.
Software Dependencies No The paper mentions software like PyTorch, AdamW optimizer, and Adam optimizer but provides citations to the papers introducing them, not specific version numbers of the software used in the experiments.
Experiment Setup Yes In the [L V] case for the Image Net-1K classification task, we adhered to the hyperparameter settings established by Aug Reg-Vi T (Steiner et al., 2022) for all training models, specifically Res Net-50, Vi T-B/32, and Vi T-B/16. For the baseline model, we trained for 300 epochs with a batch size of 1024, utilizing a learning rate of 1 10 3 and a weight decay of 5 10 2. We employed the Adam W optimizer (Loshchilov & Hutter, 2019) with cosine learning rate scheduling (Loshchilov & Hutter, 2017) and implemented a linear warmup for 20 epochs.