VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Authors: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Vis RAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20 40% end-to-end performance gain over traditional textbased RAG pipeline. Further analysis reveals that Vis RAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents.
Researcher Affiliation Collaboration 1Department of Computer Science and Technology, Tsinghua University 2Model Best Inc. 3Rice University 4Northeastern University
Pseudocode No The paper describes its methodology in Section 3, presenting mathematical formulas and textual explanations for retrieval and generation mechanisms, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/openbmb/visrag.
Open Datasets Yes To evaluate Vis RAG on real-world multi-modal documents, we construct datasets from open-source visual question answering (VQA) datasets and synthetic query-document pairs derived from webcrawled PDFs. ... We collect question-document pairs from a series of VQA datasets, targeting different document types: MP-Doc VQA (Tito et al., 2023) for industrial documents, Ar Xiv QA (Li et al., 2024b), Chart QA (Masry et al., 2022), Infographics VQA (Mathew et al., 2022), and Plot QA (Methani ets al., 2020) for various figure types, and Slide VQA (Tanaka et al., 2023) for presentation slides.
Dataset Splits Yes We follow the original datasets train-test splits, except for MP-Doc VQA and Infographics VQA, where the validation split serves as our evaluation set.
Hardware Specification Yes Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. ... Query and document encoding are conducted on an NVIDIA A100 40G GPU with a batch size of 1, while document parsing is performed on a single core of an Intel Xeon Platinum 8350C CPU.
Software Dependencies No The paper mentions specific tools like Pytesseract, Paddle Paddle OCR (PPOCR) (Du et al., 2020), and the Pillow library (for rendering screenshots), but does not provide specific version numbers for these or other software dependencies like programming languages, frameworks, or libraries that would be necessary for exact reproduction.
Experiment Setup Yes Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. The temperature parameter in Equation 2 is set to 0.02. Baseline retrievers are fine-tuned with the same hyperparameters, and textual baselines utilize extracted text data as document-side input. The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation. ... We train Mini CPM-V 2.0 with a batch size of 2048 and a learning rate of 5e-6 for 1 epoch.