VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Authors: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Vis RAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20 40% end-to-end performance gain over traditional textbased RAG pipeline. Further analysis reveals that Vis RAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Technology, Tsinghua University 2Model Best Inc. 3Rice University 4Northeastern University |
| Pseudocode | No | The paper describes its methodology in Section 3, presenting mathematical formulas and textual explanations for retrieval and generation mechanisms, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are available at https://github.com/openbmb/visrag. |
| Open Datasets | Yes | To evaluate Vis RAG on real-world multi-modal documents, we construct datasets from open-source visual question answering (VQA) datasets and synthetic query-document pairs derived from webcrawled PDFs. ... We collect question-document pairs from a series of VQA datasets, targeting different document types: MP-Doc VQA (Tito et al., 2023) for industrial documents, Ar Xiv QA (Li et al., 2024b), Chart QA (Masry et al., 2022), Infographics VQA (Mathew et al., 2022), and Plot QA (Methani ets al., 2020) for various figure types, and Slide VQA (Tanaka et al., 2023) for presentation slides. |
| Dataset Splits | Yes | We follow the original datasets train-test splits, except for MP-Doc VQA and Infographics VQA, where the validation split serves as our evaluation set. |
| Hardware Specification | Yes | Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. ... Query and document encoding are conducted on an NVIDIA A100 40G GPU with a batch size of 1, while document parsing is performed on a single core of an Intel Xeon Platinum 8350C CPU. |
| Software Dependencies | No | The paper mentions specific tools like Pytesseract, Paddle Paddle OCR (PPOCR) (Du et al., 2020), and the Pillow library (for rendering screenshots), but does not provide specific version numbers for these or other software dependencies like programming languages, frameworks, or libraries that would be necessary for exact reproduction. |
| Experiment Setup | Yes | Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. The temperature parameter in Equation 2 is set to 0.02. Baseline retrievers are fine-tuned with the same hyperparameters, and textual baselines utilize extracted text data as document-side input. The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation. ... We train Mini CPM-V 2.0 with a batch size of 2048 and a learning rate of 5e-6 for 1 epoch. |