Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Authors: Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLa VA1.5-7B and LLa VA-1.5-13B, respectively. |
| Researcher Affiliation | Collaboration | 1School of Informatics, Xiamen University, China 2Xiamen Unisound Intelligence Technology Co., Ltd 3Shanghai Artificial Intelligence Laboratory, China 4Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China |
| Pseudocode | No | The paper describes methods and processes through text and figures (e.g., Figure 2 and Figure 3 illustrate data construction and self-improvement overview), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and the supplementary file are available at https: //github.com/XMUDeep LIT/CVC. |
| Open Datasets | Yes | The CVC instances are constructed based on COCO dataset [Lin et al., 2014]. We conduct in-depth analyses on a range of challenging specialized tasks and widely-used comprehensive benchmarks, aiming to test the effectiveness of our method on the deep visual perception and general capabilities of LVLMs, respectively. Challenging specialized tasks: MMVP [Tong et al., 2024], Winoground [Thrush et al., 2022], V Bench [Wu and Xie, 2024], and VSR [Liu et al., 2023b] and comprehensive benchmarks: MME [Fu et al., 2023], MMBench [Liu et al., 2023c], SEEDBench [Li et al., 2023] and MM-Vet [Yu et al., 2023]. |
| Dataset Splits | Yes | By default, we use 90K of our data for training across all experiments unless otherwise noted. During training, we combine our data with the 665K instruction data from LLa VA-1.5 for multimodal instruction tuning. We follow [Liu et al., 2024a] to use the same testing scripts and evaluation metrics for fair comparison. |
| Hardware Specification | Yes | All experiments are conducted on 8 A100 80G GPUs. |
| Software Dependencies | No | The paper mentions several models and frameworks used (e.g., LLaMA2-7B, RoBERTa, GLIP, SAM, LLaVA-1.5), but does not provide specific version numbers for any software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We set γ, N, and α to 0.3, 16, and 0.75, respectively. During training, we combine our data with the 665K instruction data from LLa VA-1.5 for multimodal instruction tuning. To ensure a fair comparison, our training starts from the pretrained (i.e., not yet instruction-tuned) weights of LLa VA1.5, following the same training hyperparameters. |