reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Can We Predict Performance of Large Models across Vision-Language Tasks?

Authors: Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data.
Researcher Affiliation	Collaboration	1Australian National University, Canberra, Australia 2Seeing Machines Ltd, Canberra, Australia. Correspondence to: Qinyu Zhao <EMAIL>.
Pseudocode	No	The paper includes probabilistic graphical models (Figure 2) and describes algorithms in text, but it does not present a structured pseudocode block or a clearly labeled algorithm section.
Open Source Code	Yes	Our code is available at https://github.com/Qinyu-Allen-Zhao/Cross Pred-LVLM.
Open Datasets	Yes	Based on prior code repositories (Duan et al., 2024; Zhang et al., 2024b; Liang et al., 2024), we evaluate 108 LVLMs on 176 tasks in 36 benchmarks and build a 108 176 performance matrix, which is used for our main experiments on PMF and MCMC in Section 4.2 and 4.3. Full details of datasets and models are provided in the supplementary material (Section B).
Dataset Splits	Yes	Using the results from 108 LVLMs across 176 datasets, we construct a 108 176 performance matrix, with some entries masked for testing. In the experiment, we start by 20% data for initial training, 60% as the pool set, and 20% for testing.
Hardware Specification	Yes	To estimate evaluation cost of LVLMs, we evaluate a representative model, Qwen2.5-VL-Instruct 7B (Wang et al., 2024b) based on LMMs-Eval (Zhang et al., 2024b) with one A100 GPU.
Software Dependencies	No	The paper mentions using Py MC and the No-U-Turn Sampler (NUTS) but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	For MCMC, we use the No-U-Turn Sampler (NUTS) (Hoffman et al., 2014), an advanced Hamiltonian Monte Carlo method (Neal, 2011), tuning with 500 samples in the burn-in stage and drawing 100 samples. For DMF, we use MSE loss and the Adam optimizer. The learning rate is 1e-3 and the batch size is 256. The embedding dimension of each user or item is 10, which is the same for PMF. We train DMF for 200 epochs and the result of the best epoch is reported.