TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Authors: Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our approach to uncovering models capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks. We conduct extensive experiments to reveal the presence of prompt-induced underestimation and bias in MLLM evaluation and demonstrate that the TP-Eval framework effectively mitigates these issues. |
| Researcher Affiliation | Academia | 1Shanghai Artificial Intelligence Laboratory 2School of Computer Science, Shanghai Jiao Tong University 3 Zhiyuan College, Shanghai Jiao Tong University EMAIL EMAIL. The affiliations indicate a mix of university departments and a research laboratory typically associated with public research, suggesting an academic setting. |
| Pseudocode | No | The paper describes the framework of TP-Eval and its prompt customization structure using diagrams (Figure 2 and Figure 3) and textual descriptions, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code. |
| Open Source Code | No | The paper mentions using a third-party tool, 'VLMEval Kit by [Duan et al., 2024] to implement the answer extraction module in MA', but does not provide any explicit statement about making the source code for the methodology described in this paper (TP-Eval) publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use MMT-Bench and MMMU as the evaluation benchmarks. MMT-Bench is designed for the evaluation of general capabilities, while MMMU is designed for multi-discipline evaluation. Considering our limited resources, we select a subset of MMT-Bench as MMT-S, which contains 83 tasks (19 categories). We use the development set and validation set of MMMU. MMT-Bench by [Ying et al., 2024] and MMMU by [Yue et al., 2024] are cited. |
| Dataset Splits | Yes | For MMT-S, we utilize the officially designated validation set as Dfew, which comprises approximately 10% of the total data, with roughly 20 samples per task. For MMMU, we combine the development and validation sets and allocate half of the data as Dfew. |
| Hardware Specification | No | The paper mentions using specific MLLMs (LLa VA-1.5-7B, Deep Seek-VL-7B, Mini-Intern VL-Chat-4B-V1-5) and GPT-4o-mini as an optimizer and answer analyzer, but it does not provide any specific hardware details such as GPU or CPU models, memory, or cloud computing specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using GPT-4o-mini, BERT, and VLMEval Kit, but it does not specify any programming language versions, library versions, or other software dependencies with concrete version numbers required for reproducibility. |
| Experiment Setup | Yes | The total optimization iteration N = 16, with each round generating three new prompts. In each iteration, we select the top eight (i.e., K = 8) prompts for the meta prompt. We set the temperature to 1.0 when generating new prompts. During the optimization phase, we set α to 0.8 to encourage the exploration of prompts that yield higher accuracy. In the final step, we set α to 0.6 to select the optimal prompt. |