reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Authors: Yue Yang, Shuibo Zhang, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo, Wenqi Shao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs. Section 5 is titled 'EXPERIMENT' and contains tables (e.g., Table 1, 2, 3) and figures (e.g., Figure 5, 6, 7, 8, 9) presenting empirical results and performance metrics.
Researcher Affiliation	Collaboration	The authors are affiliated with 'Shanghai Jiao Tong University' (academic), 'Shanghai AI Laboratory' (research institution/industry), 'Tongji University' (academic), and 'The University of Hong Kong' (academic). The presence of both universities and a research laboratory indicates a collaborative affiliation.
Pseudocode	No	The paper describes the proposed framework and strategies (e.g., image and language bootstrapping) in detail, including instructions for GPT-4 (Table 4) and a judge module (Table 5). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code	Yes	Our code will be available at https://github.com/yangyue5114/DME.
Open Datasets	Yes	We selected five popular benchmarks to assess current LVLMs, encompassing Yes/No Questions (MME), Multiple Choice Questions (MMBench, SEEDBench), and Visual Question Answering (MMvet, LLa VABench). These benchmarks include a broad spectrum of cognitive and comprehension tasks. In Section 5.2 and 5.3, we employ three comparable datasets in terms of size: MME, MMBench (30%), and SEEDBench (10%) as the experimental datasets. Then we extend our dynamic strategies to the full set of MMBench, MMvet, and LLa VABench in Section 5.3
Dataset Splits	No	In Section 5.1 'Tasks and Datasets', the paper states: 'In Section 5.2 and 5.3, we employ three comparable datasets in terms of size: MME, MMBench (30%), and SEEDBench (10%) as the experimental datasets.' While percentages are given, the paper does not specify the methodology (e.g., random sampling, specific seed) for creating these 30% or 10% subsets, which is crucial for reproducibility of their exact data partitioning.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments. It mentions evaluating LVLMs, some of which are closed-source APIs, but no specific GPU/CPU models or other hardware details are provided for the authors' own experimental setup or variant generation.
Software Dependencies	No	The paper mentions using 'GPT-4V (Achiam et al., 2023)', 'Power Paint (Zhuang et al., 2023)', and 'VLMEvalkit (Duan et al., 2024)' but does not provide specific version numbers for any of these software components, which would be necessary for reproducibility.
Experiment Setup	Yes	We utilize the standardized evaluation platform VLMEvalkit (Duan et al., 2024) and set the generation temperature as 0 for all evaluated LVLMs to ensure a fair comparison. We set the extension ratio r = 1.5 for the main experiments.