Exploring the Better Multimodal Synergy Strategy for Vision-Language Models
Authors: Xiaotian Yin, Xin Liu, Si Chen, Yuan Wang, Yuwen Pan, Tianzhu Zhang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments demonstrate that Ds RA improves the generalizability under few-shot classification, base-to-new generalization, and domain generalization settings. |
| Researcher Affiliation | Academia | Deep Space Exploration Laboratory/School of Information Science and Technology, University of Science and Technology of China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the proposed method using textual descriptions and mathematical equations (e.g., equations 1-17) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | Our code will be released soon. |
| Open Datasets | Yes | In line with CLIP (Radford et al. 2021), we utilize 11 image benchmark datasets: Image Net (Img) (Deng et al. 2009), Caltech101 (Cal) (Fei-Fei, Fergus, and Perona 2004), FGVCAircraft (FGV) (Maji et al. 2013), Flower102 (Flo) (Nilsback and Zisserman 2008), Food101 (Foo) (Bossard, Guillaumin, and Van Gool 2014), Oxford Pets (Pet) (Parkhi et al. 2012), Stanford Cars (Car) (Krause et al. 2013), Euro SAT (Eur) (Helber et al. 2019), DTD (Cimpoi et al. 2014), SUN397 (SUN) (Xiao et al. 2010), and UCF101 (UCF) (Soomro, Zamir, and Shah 2012). ... Additionally, we conduct experiments to evaluate Ds RA s domain generalization capabilities, leveraging Image Net (Img) as the source dataset and considering its diverse domain variants, such as Image Net V2 (V2) (Recht et al. 2019), Image Net Sketch (S) (Wang et al. 2019), Image Net-A (A) (Hendrycks et al. 2021b), and Image Net R (R) (Hendrycks et al. 2021a), as the target datasets. |
| Dataset Splits | Yes | We first evaluate our model on few-shot classification, where models are trained on 1, 2, 4, 8 and 16 shots and then applied to the test sets. ... To ensure fairness, we follow the experimental methodologies outlined in Co Op (Zhou et al. 2022b) and Co Co Op (Zhou et al. 2022a), including dataset splits, data augmentation, and backbones. ... To test the base-to-new generalization ability, we follow Co Co Op to train our model only on the base classes in a 16-shot setting and evaluate the model on base and new categories. |
| Hardware Specification | Yes | All experiments are conducted on a single RTX 3090 GPU. |
| Software Dependencies | No | All experiments are conducted using the CLIP model with a Vi T-B/16 backbone. The hidden dimension dr is set to 50. SGD is used for optimization with a learning rate of 2.7e-3 and a batch size of 4. |
| Experiment Setup | Yes | The hidden dimension dr is set to 50. SGD is used for optimization with a learning rate of 2.7e-3 and a batch size of 4. All experiments are conducted on a single RTX 3090 GPU. The main results are averaged over three runs. |