Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Authors: Yue Yang, Shuibo Zhang, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo, Wenqi Shao

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs. Section 5 is titled 'EXPERIMENT' and contains tables (e.g., Table 1, 2, 3) and figures (e.g., Figure 5, 6, 7, 8, 9) presenting empirical results and performance metrics.
Researcher Affiliation Collaboration The authors are affiliated with 'Shanghai Jiao Tong University' (academic), 'Shanghai AI Laboratory' (research institution/industry), 'Tongji University' (academic), and 'The University of Hong Kong' (academic). The presence of both universities and a research laboratory indicates a collaborative affiliation.
Pseudocode No The paper describes the proposed framework and strategies (e.g., image and language bootstrapping) in detail, including instructions for GPT-4 (Table 4) and a judge module (Table 5). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code Yes Our code will be available at https://github.com/yangyue5114/DME.
Open Datasets Yes We selected five popular benchmarks to assess current LVLMs, encompassing Yes/No Questions (MME), Multiple Choice Questions (MMBench, SEEDBench), and Visual Question Answering (MMvet, LLa VABench). These benchmarks include a broad spectrum of cognitive and comprehension tasks. In Section 5.2 and 5.3, we employ three comparable datasets in terms of size: MME, MMBench (30%), and SEEDBench (10%) as the experimental datasets. Then we extend our dynamic strategies to the full set of MMBench, MMvet, and LLa VABench in Section 5.3
Dataset Splits No In Section 5.1 'Tasks and Datasets', the paper states: 'In Section 5.2 and 5.3, we employ three comparable datasets in terms of size: MME, MMBench (30%), and SEEDBench (10%) as the experimental datasets.' While percentages are given, the paper does not specify the methodology (e.g., random sampling, specific seed) for creating these 30% or 10% subsets, which is crucial for reproducibility of their exact data partitioning.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments. It mentions evaluating LVLMs, some of which are closed-source APIs, but no specific GPU/CPU models or other hardware details are provided for the authors' own experimental setup or variant generation.
Software Dependencies No The paper mentions using 'GPT-4V (Achiam et al., 2023)', 'Power Paint (Zhuang et al., 2023)', and 'VLMEvalkit (Duan et al., 2024)' but does not provide specific version numbers for any of these software components, which would be necessary for reproducibility.
Experiment Setup Yes We utilize the standardized evaluation platform VLMEvalkit (Duan et al., 2024) and set the generation temperature as 0 for all evaluated LVLMs to ensure a fair comparison. We set the extension ratio r = 1.5 for the main experiments.