Can We Predict Performance of Large Models across Vision-Language Tasks?
Authors: Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data. |
| Researcher Affiliation | Collaboration | 1Australian National University, Canberra, Australia 2Seeing Machines Ltd, Canberra, Australia. Correspondence to: Qinyu Zhao <EMAIL>. |
| Pseudocode | No | The paper includes probabilistic graphical models (Figure 2) and describes algorithms in text, but it does not present a structured pseudocode block or a clearly labeled algorithm section. |
| Open Source Code | Yes | Our code is available at https://github.com/Qinyu-Allen-Zhao/Cross Pred-LVLM. |
| Open Datasets | Yes | Based on prior code repositories (Duan et al., 2024; Zhang et al., 2024b; Liang et al., 2024), we evaluate 108 LVLMs on 176 tasks in 36 benchmarks and build a 108 176 performance matrix, which is used for our main experiments on PMF and MCMC in Section 4.2 and 4.3. Full details of datasets and models are provided in the supplementary material (Section B). |
| Dataset Splits | Yes | Using the results from 108 LVLMs across 176 datasets, we construct a 108 176 performance matrix, with some entries masked for testing. In the experiment, we start by 20% data for initial training, 60% as the pool set, and 20% for testing. |
| Hardware Specification | Yes | To estimate evaluation cost of LVLMs, we evaluate a representative model, Qwen2.5-VL-Instruct 7B (Wang et al., 2024b) based on LMMs-Eval (Zhang et al., 2024b) with one A100 GPU. |
| Software Dependencies | No | The paper mentions using Py MC and the No-U-Turn Sampler (NUTS) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | For MCMC, we use the No-U-Turn Sampler (NUTS) (Hoffman et al., 2014), an advanced Hamiltonian Monte Carlo method (Neal, 2011), tuning with 500 samples in the burn-in stage and drawing 100 samples. For DMF, we use MSE loss and the Adam optimizer. The learning rate is 1e-3 and the batch size is 256. The embedding dimension of each user or item is 10, which is the same for PMF. We train DMF for 200 epochs and the result of the best epoch is reported. |