Vision-Language Model Selection and Reuse for Downstream Adaptation
Authors: Hao-Zhe Tan, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs. ... We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. |
| Researcher Affiliation | Academia | 1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Intelligence Science and Technology, Nanjing University, China 3School of Artificial Intelligence, Nanjing University, China. |
| Pseudocode | Yes | Algorithm 1 Model Selection & Reuse Input: Model hub M, model labels {Sm}, semantic graph G, semantic graph caption dataset DG, count k of reused models pre-class, target task T = (X, Y ) Output: Task prediction {ˆy} |
| Open Source Code | Yes | Code Availability Statement The implementation code of benchmark and our proposal for the MLL paradigm is available at https://github. com/LAMDASZ-ML/MLL. |
| Open Datasets | Yes | We utilized 5 datasets, Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b) and Image Net-R (Hendrycks et al., 2021a), as Sample Datasets for semantic graph construction. Additionally, we used 17 commonly used datasets and their task general information as Target Datasets to evaluate VLM selection and reuse methods in zero-shot visual tasks (as shown in Table 5). |
| Dataset Splits | No | To evaluate the capabilities of the MLL paradigm in zero-shot visual tasks with VLMs, we need to obtain a set of sampling datasets for constructing semantic graph G, along with another set dedicated to downstream target tasks. For this study, we select 49 VLMs, 5 Sample Datasets, and 17 Target Datasets. Additionally, we collect general information about task types and domains associated with each dataset to provide a task description. For testing selected models on target tasks, we utilized same prompting strategy outlined in Radford et al. (2021), ensuring consistency in our evaluation methodology. ... The ground-truth model ranking for each target task is provided for evaluation. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A800 GPUs. |
| Software Dependencies | No | In addition, we used the Open AI text-embedding-3-large model to obtain their caption embeddings. |
| Experiment Setup | Yes | Additionally, the weight α for model selection in our setting is set to 0.7. |