SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Authors: Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that SK-VQA serves both as a challenging KB-VQA benchmark and as an effective training resource for adapting MLLMs to context-augmented generation. Our results further indicate that models trained on SK-VQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings. We perform zero-shot evaluations and fine-tuning of several state-of-the-art MLLMs on both our dataset and existing datasets. |
| Researcher Affiliation | Industry | 1Intel Labs 2Amazon 3Thoughtworks. |
| Pseudocode | No | The paper describes the dataset generation process and filtering methods in prose. It includes figures showing prompts (Figure 3, Figure 8) used to guide GPT-4, but these are not structured pseudocode or algorithm blocks describing an algorithm implemented by the authors themselves. |
| Open Source Code | Yes | Our dataset1 and its generation code2 are publicly available. 2Our code is available via Git Hub |
| Open Datasets | Yes | To address these deficiencies, we construct SK-VQA: the largest KB-VQA dataset to-date, containing over 2 million QA pairs associated with synthetic context knowledge and images sourced from LAION (Schuhmann et al., 2021), WIT (Wikipedia images) (Srinivasan et al., 2021), and the synthetic COCO-Counterfactuals dataset (Le et al., 2024). Our dataset1 and its generation code2 are publicly available. 1Our dataset is available via Hugging Face Hub. Existing datasets used in our study The WIT dataset is available under the Creative Commons Attribution-Share Alike 3.0 Unported license. The Vi Qu AE datset is available under the MIT license. The COCO-Counterfactuals dataset is available under the CC BY 4.0 license. The Info Seek dataset is available under the Apache 2.0 license. |
| Dataset Splits | Yes | For Info Seek, we use a 140K subset of the training data processed by Wei et al. (2023)... We use the original Enc-VQA training set, but since each question can be paired with multiple images, we select only the first image from the original annotations for the training set, which results in approximately 220K training samples. For a fair comparison, we down-sample our dataset subsets to 200K samples each. For Info Seek, we use a subset of its validation set processed by Wei et al. (2023), which includes 11,323 samples... For Enc-VQA, we use its official test set, which contains 5,750 samples. Due to the small size of the Vi Qu AE test set, we combine the train, validation, and test sets to create a larger testing set of 3,625 samples. Additionally, we use 10,744 samples from SK-VQAIR associated with images from LAION for model evaluation. |
| Hardware Specification | Yes | We utilized 24 Intel Gaudi2 AI Accelerators to obtain LLa MA-3-70b predictions for our dataset... For our zero-shot MLLM evaluation and MLLM training experiments, we used an internal linux slurm cluster with Nvidia RTX 3090, Nvidia A6000, and Nvidia A100 GPUs. We used up to 48 GPUs to parallelize various experiments on this cluster. Each parallelized worker was allocated 14 Intel(R) Xeon(R) Platinum 8280 CPUs, 124 GB of RAM, and 1 GPU. |
| Software Dependencies | Yes | We use the official codebase7 from LLa VA-1.5 to fine-tune the llava-v1.5-7b model8 and the Trainer from Huggingface Transformers library9 to fine-tune the paligemma-3b-mix-224 model 10. Specifically, we use the widely-used Language Tool 4 on a random sample of 10K context documents. |
| Experiment Setup | Yes | For the llava-v1.5-7b model, we use a batch size of 16 and a learning rate of 2e-5, training the model for one epoch using bfloat16. Similarly, for the paligemma-3b-mix-224 model, we use a batch size of 64 and a learning rate of 2e-5, also training for one epoch using bfloat16. These are default hyperparamter values which were not tuned as part of our expierments. The inputs to the models are a combination of the question, image, and context, and the outputs are the answers to the questions. |