reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Ranked from Within: Ranking Large Multimodal Models Without Labels

Authors: Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 47 state-of-the-art LMMs (e.g., LLa VA) across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks.
Researcher Affiliation	Academia	1Australian National University 2Sydney AI Centre, The University of Sydney 3Curtin University 4University of OBuda.
Pseudocode	No	The paper describes methods using equations and prose but does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper states: "All 45 models can be downloaded via Huggingface with different versions of transformer library". This refers to the models used by the authors, not the authors' own source code for their methodology. There is no explicit statement of the authors releasing their code or a direct link to a repository for their specific implementation.
Open Datasets	Yes	All experiments are conducted on VLMEval Kit (Duan et al., 2024). We consider 8 datasets and their corresponding links to download TSV files via the toolkit: Science QA (Lu et al., 2022) (https://opencompass.openxlab.space/utils/VLMEval/Science QA_ TEST.tsv); AI2D (Hiippala et al., 2021) (https://opencompass.openxlab.space/utils/VLMEval/AI2D_TEST. tsv); Chart QA (Masry et al., 2022) (https://opencompass.openxlab.space/utils/VLMEval/Chart QA_TEST. tsv); OCRVQA (Mishra et al., 2019) (https://opencompass.openxlab.space/utils/VLMEval/OCRVQA_ TESTCORE.tsv); Text VQA (Singh et al., 2019) (https://opencompass.openxlab.space/utils/VLMEval/Text VQA_VAL. tsv); Doc VQA (Mathew et al., 2021) (https://opencompass.openxlab.space/utils/VLMEval/Doc VQA_VAL. tsv); Real World QA (x.ai, 2024) (https://opencompass.openxlab.space/utils/VLMEval/Real World.tsv); MMMU (Yue et al., 2024) (https://opencompass.openxlab.space/utils/VLMEval/MMMU_DEV_VAL. tsv); GQA (Ainslie et al., 2023) (https://opencompass.openxlab.space/utils/VLMEval/GQA_Test Dev_ Balanced.tsv)
Dataset Splits	Yes	We evaluate on multiple choice visual question (MCVQ) and visual question answering (VQA) benchmarks. We consider 8 widely-adopted MCVQ and VQA benchmarks. They are (1) the subset of Science QA (Lu et al., 2022) with images (SQAI) and AI2D (Hiippala et al., 2021) which assess LMMs scientific knowledge; (2) Chart QA (Masry et al., 2022), OCRVQA (Mishra et al., 2019), Text VQA (Singh et al., 2019) and Doc VQA (Mathew et al., 2021) to examine their ability to recognize optical character; (3) Real World QA (RWQA) (x.ai, 2024) and GQA (Ainslie et al., 2023) which evaluate LMMs vision-centric capability; (4) MMMU (Yue et al., 2024) which assays LMMs on multi-disciplinary tasks that demand college-level subject knowledge. Note that SQA-I, AI2D, RWQA and MMMU are MCVQ datasets, while the others are VQA. ... All experiments are conducted on VLMEval Kit (Duan et al., 2024). We consider 8 datasets and their corresponding links to download TSV files via the toolkit: ... MMMU (Yue et al., 2024) (https://opencompass.openxlab.space/utils/VLMEval/MMMU_DEV_VAL. tsv); GQA (Ainslie et al., 2023) (https://opencompass.openxlab.space/utils/VLMEval/GQA_Test Dev_ Balanced.tsv)
Hardware Specification	Yes	All experiment is run on four A6000 GPUs.
Software Dependencies	Yes	Py Torch version is 2.01.0+cu117. All 45 models can be downloaded via Huggingface with different versions of transformer library: transformers==4.33.0 for m PLUG-Owl2 (Li et al., 2022) and Instruct BLIP (Dai et al., 2023); transformers==4.37.0 for LLa VA-V1.5 (Liu et al., 2024a), Share GPT4V (Chen et al., 2023a), Intern VL (Chen et al., 2024) series; transformers==latest for LLa VA-Ne XT (Liu et al., 2024b), LLa VA-One Vision (Li et al., 2024a), LLa VA-Ne XTInterleave (Li et al., 2024b) Pali Gemma-3B (Beyer et al., 2024), Mantis (Jiang et al., 2024), Eagle (Shi et al., 2024b) and LLa VA Prismatic (Karamcheti et al., 2024) series.
Experiment Setup	Yes	To analyze the consistency in these stochastic predictions, we explore two common methods: BLEU (Papineni et al., 2002) and BERTScore (Zhang et al., 2019). Then, we use the mean value of similarities to represent the consistency for the input sample, which can be denoted as 1T PT i=1 sim(Pi, Pori), where T is the number of stochastic inferences, sim( ) is the similarity function (e.g., BLEU), and Pi and Pori are the i-th stochastic prediction and the original answer, respectively. Following the practice outlined in (Chen et al., 2023b; Cobbe et al., 2021; Huang et al., 2023), we collect five stochastic inferences per sample and set t = 0.7 to maintain a relatively high degree of stochasticity in LMM generation while keeping compute overhead manageable.