reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Is Your Video Language Model a Reliable Judge?

Authors: Ming Liu, Wensheng Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study investigates the efficacy of such approaches, particularly when the pool of judges includes both reliable and unreliable models. Our findings reveal that incorporating collective judgments from such a mixed pool does not necessarily improve the accuracy of the final evaluation. ... To explore the factors that impact evaluation reliability, we fine-tune an underperforming VLM judge, Video-LLaVA, and observe that improved understanding ability alone is insufficient to make VLM judges more reliable. ... Our experiments with collective thought approaches do not yield significant improvements in evaluation reliability. The mixing of reliable and unreliable judges introduces noise. Even when selecting judges based on reliability scores, the mixture of judges does not substantially enhance agreement with the Agent-Debate method.
Researcher Affiliation	Academia	Ming Liu, Wensheng Zhang Department of Computer Science, Iowa State University EMAIL
Pseudocode	No	The paper describes its methodology in prose and through diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing its own code or a link to a repository for the methodology described. It only mentions the Video Chat GPT dataset with a HuggingFace link.
Open Datasets	Yes	Phase 1: Video-Question Pair Collection The Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) (Khattak et al., 2024) is a dataset that comprehensively assesses VLMs across 11 diverse real-world visual dimensions Vd (see Table 2), such as interpretation of social context. ... To verify the generality of our findings, we also performed experiments on the Video Chat GPT dataset1. The results are presented in Appendix F. They are consistent with those from the CVRR-ES: less capable VLMs (e.g., Video-LLa VA) systematically overrate candidates, whereas GPT-4o maintains moderate agreement with the text-only reference-guided judge. These additional experiments reinforce our conclusion that less capable VLMs are unreliable as judges in different datasets and conditions. 1https://huggingface.co/datasets/lmms-lab/Video Chat GPT
Dataset Splits	No	The paper mentions using the CVRR-ES dataset and the Video Chat GPT dataset, but it does not specify any training, validation, or test splits for these datasets within the context of their experiments. It describes collecting video-question pairs and VLM responses, but not how this data was partitioned for evaluation.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments or training the models.
Software Dependencies	No	The paper lists various models (e.g., Video-LLa VA, GPT-4o) but does not specify any software dependencies (e.g., programming languages, libraries, frameworks) with version numbers that would be needed to replicate the experimental setup.
Experiment Setup	No	The paper describes the overall experimental approach, including the models used as candidates and judges, and the evaluation metrics (Weighted Cohen's Kappa). However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other detailed training configurations for any of the models involved, including the fine-tuning of Video-LLa VA.