reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Authors: Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that Vis RAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20 40% end-to-end performance gain over traditional textbased RAG pipeline. Further analysis reveals that Vis RAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents.
Researcher Affiliation	Collaboration	1Department of Computer Science and Technology, Tsinghua University 2Model Best Inc. 3Rice University 4Northeastern University
Pseudocode	No	The paper describes its methodology in Section 3, presenting mathematical formulas and textual explanations for retrieval and generation mechanisms, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and data are available at https://github.com/openbmb/visrag.
Open Datasets	Yes	To evaluate Vis RAG on real-world multi-modal documents, we construct datasets from open-source visual question answering (VQA) datasets and synthetic query-document pairs derived from webcrawled PDFs. ... We collect question-document pairs from a series of VQA datasets, targeting different document types: MP-Doc VQA (Tito et al., 2023) for industrial documents, Ar Xiv QA (Li et al., 2024b), Chart QA (Masry et al., 2022), Infographics VQA (Mathew et al., 2022), and Plot QA (Methani ets al., 2020) for various figure types, and Slide VQA (Tanaka et al., 2023) for presentation slides.
Dataset Splits	Yes	We follow the original datasets train-test splits, except for MP-Doc VQA and Infographics VQA, where the validation split serves as our evaluation set.
Hardware Specification	Yes	Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. ... Query and document encoding are conducted on an NVIDIA A100 40G GPU with a batch size of 1, while document parsing is performed on a single core of an Intel Xeon Platinum 8350C CPU.
Software Dependencies	No	The paper mentions specific tools like Pytesseract, Paddle Paddle OCR (PPOCR) (Du et al., 2020), and the Pillow library (for rendering screenshots), but does not provide specific version numbers for these or other software dependencies like programming languages, frameworks, or libraries that would be necessary for exact reproduction.
Experiment Setup	Yes	Vis RAG-Ret is fine-tuned using in-batch negatives (Karpukhin et al., 2020) for one epoch with a batch size of 128 on 8 NVIDIA A100 80GB GPUs. The temperature parameter in Equation 2 is set to 0.02. Baseline retrievers are fine-tuned with the same hyperparameters, and textual baselines utilize extracted text data as document-side input. The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation. ... We train Mini CPM-V 2.0 with a batch size of 2048 and a learning rate of 5e-6 for 1 epoch.