MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-source RAG
Authors: Yuming Qiao, Yuechen Wang, Dan Meng, Haonan Lu, Zhenyu Yang, Xudong Zhang
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of Ms RAG, we conducted a series of qualitative and quantitative experiments. The evaluation results demonstrate the superiority of Ms RAG over other methods. |
| Researcher Affiliation | Collaboration | Yuming Qiao1 , Yuechen Wang1 , Dan Meng1, , Haonan Lu2 , Zhenyu Yang2 , Xudong Zhang1 1OPPO Research Institute 2OPPO AI Center EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes the Ms RAG framework and its components (Parallel Visual Search Module, Prompt Templates Pool, Visual-RAG Alignment Module) through descriptive text and architectural diagrams (Fig. 2, Fig. 3), but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We evaluate Ms RAG on LVLMs using three datasets: Cap Fusion, Kale, and KAC-dataset. Cap Fusion and Kale are public captioning datasets with real-world knowledge, aligning well with the knowledge-augmented captioning task, effectively testing Ms RAG s retrieval and utilization of external information without queries. [Yu et al., 2024] [Awadalla et al., 2024] |
| Dataset Splits | No | The paper introduces the KAC-dataset and mentions using Cap Fusion and Kale for evaluation but does not specify any training, validation, or test dataset splits (e.g., percentages, sample counts, or specific predefined splits). |
| Hardware Specification | Yes | All experiments were run on two Nvidia A100s. |
| Software Dependencies | No | For closed-source models (GPT-4o, Claude), we use their APIs; for open-source models, we deploy them with vllm[Kwon et al., 2023]. Specific version numbers for software dependencies are not provided. |
| Experiment Setup | No | The paper describes the overall Ms RAG framework and mentions integrating various LVLMs (GPT-4o, Claude-3.5-Sonnet, Qwen2-VL, and Intern VL2), but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations. |