MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-source RAG

Authors: Yuming Qiao, Yuechen Wang, Dan Meng, Haonan Lu, Zhenyu Yang, Xudong Zhang

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of Ms RAG, we conducted a series of qualitative and quantitative experiments. The evaluation results demonstrate the superiority of Ms RAG over other methods.
Researcher Affiliation Collaboration Yuming Qiao1 , Yuechen Wang1 , Dan Meng1, , Haonan Lu2 , Zhenyu Yang2 , Xudong Zhang1 1OPPO Research Institute 2OPPO AI Center EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the Ms RAG framework and its components (Parallel Visual Search Module, Prompt Templates Pool, Visual-RAG Alignment Module) through descriptive text and architectural diagrams (Fig. 2, Fig. 3), but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code, nor does it provide any links to a code repository.
Open Datasets Yes We evaluate Ms RAG on LVLMs using three datasets: Cap Fusion, Kale, and KAC-dataset. Cap Fusion and Kale are public captioning datasets with real-world knowledge, aligning well with the knowledge-augmented captioning task, effectively testing Ms RAG s retrieval and utilization of external information without queries. [Yu et al., 2024] [Awadalla et al., 2024]
Dataset Splits No The paper introduces the KAC-dataset and mentions using Cap Fusion and Kale for evaluation but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or specific predefined splits).
Hardware Specification Yes All experiments were run on two Nvidia A100s.
Software Dependencies No For closed-source models (GPT-4o, Claude), we use their APIs; for open-source models, we deploy them with vllm[Kwon et al., 2023]. Specific version numbers for software dependencies are not provided.
Experiment Setup No The paper describes the overall Ms RAG framework and mentions integrating various LVLMs (GPT-4o, Claude-3.5-Sonnet, Qwen2-VL, and Intern VL2), but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations.