reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Vision-Language Model Selection and Reuse for Downstream Adaptation

Authors: Hao-Zhe Tan, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs. ... We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets.
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Intelligence Science and Technology, Nanjing University, China 3School of Artificial Intelligence, Nanjing University, China.
Pseudocode	Yes	Algorithm 1 Model Selection & Reuse Input: Model hub M, model labels {Sm}, semantic graph G, semantic graph caption dataset DG, count k of reused models pre-class, target task T = (X, Y ) Output: Task prediction {ˆy}
Open Source Code	Yes	Code Availability Statement The implementation code of benchmark and our proposal for the MLL paradigm is available at https://github. com/LAMDASZ-ML/MLL.
Open Datasets	Yes	We utilized 5 datasets, Image Net (Deng et al., 2009), Image Net-V2 (Recht et al., 2019), Image Net-Sketch (Wang et al., 2019), Image Net-A (Hendrycks et al., 2021b) and Image Net-R (Hendrycks et al., 2021a), as Sample Datasets for semantic graph construction. Additionally, we used 17 commonly used datasets and their task general information as Target Datasets to evaluate VLM selection and reuse methods in zero-shot visual tasks (as shown in Table 5).
Dataset Splits	No	To evaluate the capabilities of the MLL paradigm in zero-shot visual tasks with VLMs, we need to obtain a set of sampling datasets for constructing semantic graph G, along with another set dedicated to downstream target tasks. For this study, we select 49 VLMs, 5 Sample Datasets, and 17 Target Datasets. Additionally, we collect general information about task types and domains associated with each dataset to provide a task description. For testing selected models on target tasks, we utilized same prompting strategy outlined in Radford et al. (2021), ensuring consistency in our evaluation methodology. ... The ground-truth model ranking for each target task is provided for evaluation.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A800 GPUs.
Software Dependencies	No	In addition, we used the Open AI text-embedding-3-large model to obtain their caption embeddings.
Experiment Setup	Yes	Additionally, the weight α for model selection in our setting is set to 0.7.