Multi-Sourced Compositional Generalization in Visual Question Answering
Authors: Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. ... To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset... Experimental results demonstrate that the proposed framework significantly improves VQA models generalization ability to multi-sourced novel compositions while maintaining their independent and identically distributed (IID) generalization ability. |
| Researcher Affiliation | Academia | 1Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China 2Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China |
| Pseudocode | No | The paper describes the proposed framework and its components (retrieval database construction, feature retrieval, and feature aggregation) in descriptive text and with a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The GQA-MSCG dataset is available at https://github.com/Never More LCH/MSCG. This statement explicitly provides access to the GQA-MSCG dataset, not the source code for the methodology described in the paper. |
| Open Datasets | Yes | To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset... The GQA-MSCG dataset is available at https://github.com/Never More LCH/MSCG. ... three datasets are selected to validate the effectiveness of the proposed frameworks: the GQA dataset [Hudson and Manning, 2019], the VQA v2 dataset [Goyal et al., 2017] and our GQA-MSCG dataset. |
| Dataset Splits | Yes | For experiments on the GQA dataset and the GQA-MSCG dataset, we fine-tune CFR, Qwen-VL, CFR+RAG, and Qwen-VL+RAG using the train balanced split of the GQA dataset and selected the best-performing model weights on the val balanced split of GQA. Using these model weights, we present the experimental results on the test-dev split of the GQA dataset and all seven test splits of our GQA-MSCG dataset. ... For each category of test samples, we randomly sample 5,000 samples from Dc, resulting in a total of 35,000 samples for the GQA-MSCG dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions model sizes (e.g., 'parameter size less than 0.2B', 'more than 7B parameters'). |
| Software Dependencies | No | The paper mentions using the 'NLTK toolkit [Bird et al., 2009]' but does not specify a version number. It also refers to methods like 'LoRA [Hu et al., 2022]' and 'Faster R-CNN [Ren et al., 2016]', which are architectures or techniques, not specific software libraries with version numbers required for replication. |
| Experiment Setup | Yes | For experiments on all three datasets including GQA, GQA-MSCG and VQA v2, we finetune Qwen-VL and Qwen-VL+RAG with Lo RA [Hu et al., 2022] with a maximum of 2 epochs. For CFR+RAG and Qwen-VL+RAG, we set wq = 0.6 and wv = 0.4. ... The maximum number of epochs for fine-tuning CFR and CFR+RAG was set to 12. The sampled number Tq and Tv for constructing Dq and Dq are set to 8 and 32, respectively. ... The number of aggregated primitive Kq and Kv are set to 4 and 16, respectively. Distinctively, for experiments on the VQA v2 dataset, we set Tq = 1, Tv = 32, Kq = 4 and Kv = 4. |