Can One Modality Model Synergize Training of Other Modality Models?
Authors: Jae-Jun Lee, Sung Whan Yoon
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As proofs of concept, we broadly confirm the considerable gains from the synergy across visual, language, and audio models. ... In this section, we provide an overview of the experimental results, along with detailed descriptions of the datasets, models and additional experimental settings. ... Our main results are two parts: Table 1 with Image Net-1K and Table 2 with multimodal datasets, i.e., IEMOCAP and AVMNIST. |
| Researcher Affiliation | Academia | Jae-Jun Lee1, Sung Whan Yoon1,2 1Graduate School of Artificial Intelligence and 2Department of Electrical Engineering Ulsan National Institute of Science and Technology (UNIST) EMAIL |
| Pseudocode | Yes | Algorithm 1 Traininig Procedures for [Mj Mi] |
| Open Source Code | Yes | The code is available at https://github.com/johnjaejunlee95/synergistic-multimodal. |
| Open Datasets | Yes | Datasets For the main experiments, we test on the Image Net-1K dataset (Krizhevsky et al., 2012) for visual tasks as the case of [L V]. For further experiments in the multimodal setting, we employ the IEMOCAP (Busso et al., 2008) and AVMNIST (Liang et al., 2021; Li et al., 2023) datasets. |
| Dataset Splits | Yes | Datasets For the main experiments, we test on the Image Net-1K dataset (Krizhevsky et al., 2012) for visual tasks as the case of [L V]. ... Image Net-1K (Krizhevsky et al., 2012) is the image datasets that contains 1000 classes with 1,281,167 training images and 50,000 validation images. |
| Hardware Specification | Yes | Furthermore, we utilized Automatic Mixed Precision (Micikevicius et al., 2018) in conjunction with 4 A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, AdamW optimizer, and Adam optimizer but provides citations to the papers introducing them, not specific version numbers of the software used in the experiments. |
| Experiment Setup | Yes | In the [L V] case for the Image Net-1K classification task, we adhered to the hyperparameter settings established by Aug Reg-Vi T (Steiner et al., 2022) for all training models, specifically Res Net-50, Vi T-B/32, and Vi T-B/16. For the baseline model, we trained for 300 epochs with a batch size of 1024, utilizing a learning rate of 1 10 3 and a weight decay of 5 10 2. We employed the Adam W optimizer (Loshchilov & Hutter, 2019) with cosine learning rate scheduling (Loshchilov & Hutter, 2017) and implemented a linear warmup for 20 epochs. |