reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Authors: Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen, Xiawu Zheng, Xing Sun, Rongrong Ji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an 85.8% accuracy rate in image association and a 0.508 Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Pseudocode	No	The paper describes tasks (IITC, ITA) and data construction processes, but it does not contain any explicitly labeled pseudocode blocks or algorithms in a structured format.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets	Yes	Introducing the VEGA Dataset. We develop a new VEGA Dataset for the IITC task that enables a comprehensive understanding of scientific literature, whose multi-modal context reaches up to 8,000 tokens in length and contains up to 8 images. We develop the VEGA dataset upon the foundation of Sci Graph QA Li & Tajbakhsh (2023).
Dataset Splits	Yes	The IITC subset, which is segmented into two categories based on token length: one supports up to 4,000 tokens, while the other extends to 8,000 tokens. Here, images are equated to 256 tokens each. Both categories offer roughly 200k training instances and 700 meticulously curated, high-caliber test samples. The ITA task is categorized into six divisions, with two dedicated to image quantity and three to text length. Table. 1 presents the train and test data statistics for the VEGA dataset, while Fig. 4 details the distribution of image numbers and token counts within the IITC subset, providing insights into the visual and textual context lengths.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions MLLMs but not the underlying hardware for their training or evaluation.
Software Dependencies	No	The paper mentions several models like Qwen-VL-Chat, Gemini-1.5-pro, GPT4V, Intern VL-1.5, Qwen2-VL-Instruct, and Llama-3.1-70B-Instruct, as well as the VLMEval Kit. However, it does not specify version numbers for any of these software components, which is required for reproducible dependency descriptions.
Experiment Setup	No	We fine-tune the Qwen-VL-Chat Bai et al. (2023) model at two distinct maximum token lengths, 4k and 8k, training a dedicated model for each configuration, denoted as VEGA-Base-4k and VEGA-Base-8k. For more training detail, please see our supplementary materials.