DS-VLM: Diffusion Supervision Vision Language Model

Authors: Zhen Sun, Yunhang Shen, Jie Li, Xing Sun, Pingyang Dai, Liujuan Cao, Rongrong Ji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.
Researcher Affiliation Collaboration 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Tencent You Tu Lab 3School of Informatics, Xiamen University, Xiamen, China. 4Institute of Artificial Intelligence,Xiamen University,Xiamen,China. Correspondence to: Liujuan Cao <EMAIL>.
Pseudocode No The paper describes methods and equations (e.g., in Section 3.2 and 3.3) but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not explicitly state that the code for DS-VLM is open-source, nor does it provide a direct link to a code repository. It mentions implementing the strategy on top of LLaVA-1.5, an existing framework, but does not confirm code release for their specific contribution.
Open Datasets Yes Focusing on proposing a novel optimization method for the VLM framework, we do not incorporate any additional data beyond the LLaVA-1.5 open-source dataset (Liu et al., 2024a)... The evaluation datasets include: MMBench (MMB) (Liu et al., 2025), MMS (MMStar) (Chen et al., 2024), MMMU (Yue et al., 2024), MV (Math Vista) (Lu et al., 2023), OCRB (OCRBench) (Liu et al., 2023), AI2D (Hiippala et al., 2021), HB (Hallusion Bench) (Guan et al., 2024), LB (LLaVABench) (Liu et al., 2024b), SQA (Science QA) (Saikh et al., 2022), and MME (Fu et al., 2024).
Dataset Splits Yes Focusing on proposing a novel optimization method for the VLM framework, we do not incorporate any additional data beyond the LLaVA-1.5 open-source dataset (Liu et al., 2024a), which includes 558K image captions for pre-training and 665K conversations for instruction tuning. We also apply our proposed method to the Mini-Gemini dataset (Team), which consists of 1.2M + 1.5M data, to further highlight the superiority of our approach. For training configurations, we adhere strictly to the settings outlined in the original LLaVA-1.5 paper to ensure fairness, with learning rates of 1e-3 and 2e-5 for pre-training and instruction finetuning phases, respectively, and maintaining batch sizes of 256 and 128.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions that 'The training process for DS-VLM utilizes the PyTorch framework' but does not specify the version of PyTorch or other software dependencies with their version numbers.
Experiment Setup Yes For training configurations, we adhere strictly to the settings outlined in the original LLaVA-1.5 paper to ensure fairness, with learning rates of 1e-3 and 2e-5 for pre-training and instruction finetuning phases, respectively, and maintaining batch sizes of 256 and 128. During the LoRA fine-tuning process, the rank of all linear layers is uniformly set to 8. We select the 8th, 16th, and 24th layers of the Vision encoder as the feature representatives of the low, mid and high layers, respectively. The number of iterations for the diffusion model is 50.