reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Authors: Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the effectiveness of our approach to uncovering models capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks. We conduct extensive experiments to reveal the presence of prompt-induced underestimation and bias in MLLM evaluation and demonstrate that the TP-Eval framework effectively mitigates these issues.
Researcher Affiliation	Academia	1Shanghai Artificial Intelligence Laboratory 2School of Computer Science, Shanghai Jiao Tong University 3 Zhiyuan College, Shanghai Jiao Tong University EMAIL EMAIL. The affiliations indicate a mix of university departments and a research laboratory typically associated with public research, suggesting an academic setting.
Pseudocode	No	The paper describes the framework of TP-Eval and its prompt customization structure using diagrams (Figure 2 and Figure 3) and textual descriptions, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	No	The paper mentions using a third-party tool, 'VLMEval Kit by [Duan et al., 2024] to implement the answer extraction module in MA', but does not provide any explicit statement about making the source code for the methodology described in this paper (TP-Eval) publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We use MMT-Bench and MMMU as the evaluation benchmarks. MMT-Bench is designed for the evaluation of general capabilities, while MMMU is designed for multi-discipline evaluation. Considering our limited resources, we select a subset of MMT-Bench as MMT-S, which contains 83 tasks (19 categories). We use the development set and validation set of MMMU. MMT-Bench by [Ying et al., 2024] and MMMU by [Yue et al., 2024] are cited.
Dataset Splits	Yes	For MMT-S, we utilize the officially designated validation set as Dfew, which comprises approximately 10% of the total data, with roughly 20 samples per task. For MMMU, we combine the development and validation sets and allocate half of the data as Dfew.
Hardware Specification	No	The paper mentions using specific MLLMs (LLa VA-1.5-7B, Deep Seek-VL-7B, Mini-Intern VL-Chat-4B-V1-5) and GPT-4o-mini as an optimizer and answer analyzer, but it does not provide any specific hardware details such as GPU or CPU models, memory, or cloud computing specifications used for running the experiments.
Software Dependencies	No	The paper mentions using GPT-4o-mini, BERT, and VLMEval Kit, but it does not specify any programming language versions, library versions, or other software dependencies with concrete version numbers required for reproducibility.
Experiment Setup	Yes	The total optimization iteration N = 16, with each round generating three new prompts. In each iteration, we select the top eight (i.e., K = 8) prompts for the meta prompt. We set the temperature to 1.0 when generating new prompts. During the optimization phase, we set α to 0.8 to encourage the exploration of prompts that yield higher accuracy. In the final step, we set α to 0.6 to select the optimal prompt.