Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Authors: Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLa VA1.5-7B and LLa VA-1.5-13B, respectively.
Researcher Affiliation Collaboration 1School of Informatics, Xiamen University, China 2Xiamen Unisound Intelligence Technology Co., Ltd 3Shanghai Artificial Intelligence Laboratory, China 4Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China
Pseudocode No The paper describes methods and processes through text and figures (e.g., Figure 2 and Figure 3 illustrate data construction and self-improvement overview), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and the supplementary file are available at https: //github.com/XMUDeep LIT/CVC.
Open Datasets Yes The CVC instances are constructed based on COCO dataset [Lin et al., 2014]. We conduct in-depth analyses on a range of challenging specialized tasks and widely-used comprehensive benchmarks, aiming to test the effectiveness of our method on the deep visual perception and general capabilities of LVLMs, respectively. Challenging specialized tasks: MMVP [Tong et al., 2024], Winoground [Thrush et al., 2022], V Bench [Wu and Xie, 2024], and VSR [Liu et al., 2023b] and comprehensive benchmarks: MME [Fu et al., 2023], MMBench [Liu et al., 2023c], SEEDBench [Li et al., 2023] and MM-Vet [Yu et al., 2023].
Dataset Splits Yes By default, we use 90K of our data for training across all experiments unless otherwise noted. During training, we combine our data with the 665K instruction data from LLa VA-1.5 for multimodal instruction tuning. We follow [Liu et al., 2024a] to use the same testing scripts and evaluation metrics for fair comparison.
Hardware Specification Yes All experiments are conducted on 8 A100 80G GPUs.
Software Dependencies No The paper mentions several models and frameworks used (e.g., LLaMA2-7B, RoBERTa, GLIP, SAM, LLaVA-1.5), but does not provide specific version numbers for any software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We set γ, N, and α to 0.3, 16, and 0.75, respectively. During training, we combine our data with the 665K instruction data from LLa VA-1.5 for multimodal instruction tuning. To ensure a fair comparison, our training starts from the pretrained (i.e., not yet instruction-tuned) weights of LLa VA1.5, following the same training hyperparameters.