Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Authors: Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen, Xiawu Zheng, Xing Sun, Rongrong Ji

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation of four leading closed-source models, as well as various open-source models using VEGA, underscores the rigorous nature of IITC. Even the most advanced models, such as Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a multi-task, multi-scale post-training strategy, we have set a robust baseline for MLLMs on the IITC task, attaining an 85.8% accuracy rate in image association and a 0.508 Rouge score. These results validate the effectiveness of our dataset in improving MLLMs capabilities for nuanced image-text comprehension.
Researcher Affiliation Academia 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
Pseudocode No The paper describes tasks (IITC, ITA) and data construction processes, but it does not contain any explicitly labeled pseudocode blocks or algorithms in a structured format.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets Yes Introducing the VEGA Dataset. We develop a new VEGA Dataset for the IITC task that enables a comprehensive understanding of scientific literature, whose multi-modal context reaches up to 8,000 tokens in length and contains up to 8 images. We develop the VEGA dataset upon the foundation of Sci Graph QA Li & Tajbakhsh (2023).
Dataset Splits Yes The IITC subset, which is segmented into two categories based on token length: one supports up to 4,000 tokens, while the other extends to 8,000 tokens. Here, images are equated to 256 tokens each. Both categories offer roughly 200k training instances and 700 meticulously curated, high-caliber test samples. The ITA task is categorized into six divisions, with two dedicated to image quantity and three to text length. Table. 1 presents the train and test data statistics for the VEGA dataset, while Fig. 4 details the distribution of image numbers and token counts within the IITC subset, providing insights into the visual and textual context lengths.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions MLLMs but not the underlying hardware for their training or evaluation.
Software Dependencies No The paper mentions several models like Qwen-VL-Chat, Gemini-1.5-pro, GPT4V, Intern VL-1.5, Qwen2-VL-Instruct, and Llama-3.1-70B-Instruct, as well as the VLMEval Kit. However, it does not specify version numbers for any of these software components, which is required for reproducible dependency descriptions.
Experiment Setup No We fine-tune the Qwen-VL-Chat Bai et al. (2023) model at two distinct maximum token lengths, 4k and 8k, training a dedicated model for each configuration, denoted as VEGA-Base-4k and VEGA-Base-8k. For more training detail, please see our supplementary materials.