Vision-Language Model Selection and Reuse for Downstream Adaptation

Authors: Hao-Zhe Tan, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs. ... We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets.
Researcher Affiliation Academia 1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Intelligence Science and Technology, Nanjing University, China 3School of Artificial Intelligence, Nanjing University, China.
Pseudocode Yes Algorithm 1 Model Selection & Reuse Input: Model hub M, model labels {Sm}, semantic graph G, semantic graph caption dataset DG, count k of reused models pre-class, target task T = (X, Y ) Output: Task prediction {ˆy}
Open Source Code Yes Code Availability Statement The implementation code of benchmark and our proposal for the MLL paradigm is available at https://github. com/LAMDASZ-ML/MLL.
Open Datasets Yes We utilized 5 datasets, Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b) and Image Net-R (Hendrycks et al., 2021a), as Sample Datasets for semantic graph construction. Additionally, we used 17 commonly used datasets and their task general information as Target Datasets to evaluate VLM selection and reuse methods in zero-shot visual tasks (as shown in Table 5).
Dataset Splits No To evaluate the capabilities of the MLL paradigm in zero-shot visual tasks with VLMs, we need to obtain a set of sampling datasets for constructing semantic graph G, along with another set dedicated to downstream target tasks. For this study, we select 49 VLMs, 5 Sample Datasets, and 17 Target Datasets. Additionally, we collect general information about task types and domains associated with each dataset to provide a task description. For testing selected models on target tasks, we utilized same prompting strategy outlined in Radford et al. (2021), ensuring consistency in our evaluation methodology. ... The ground-truth model ranking for each target task is provided for evaluation.
Hardware Specification Yes All experiments are conducted on NVIDIA A800 GPUs.
Software Dependencies No In addition, we used the Open AI text-embedding-3-large model to obtain their caption embeddings.
Experiment Setup Yes Additionally, the weight α for model selection in our setting is set to 0.7.