SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan Rossi, Changyou Chen, Tong Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG. We assess the performance of SV-RAG in evidence page retrieval and visual question answering capabilities. We first evaluate the retrieval accuracy of the Col-retrieval module within SV-RAG and compare it with several baselines on Slide VQA (Tanaka et al., 2023), MM-Long Bench (Ma et al., 2024d), DUDE (Van Landeghem et al., 2023), Doc VQA (Mathew et al., 2020; 2021) and Vis R-Bench. We then conduct experiments on question answering using SV-RAG and compare the results with other LMM baselines, inlcuding single-page and cross-page VQA.
Researcher Affiliation Collaboration Jian Chen1 , Ruiyi Zhang2 , Yufan Zhou2, Tong Yu2, Franck Dernoncourt2 Jiuxiang Gu2, Ryan Rossi2, Changyou Chen1, Tong Sun2 University at Buffalo1, Adobe Research2 EMAIL
Pseudocode Yes Algorithm 1 Col-retrieval training Require: Pre-trained MLLM {fv, f r l }, training batch of evidence image and question pairs {(X1, y1), , (Xb, yb)}. 1: Initialize the Col-projection layer fp. 2: while not converged do 3: Get Ei v = fp(f r l (fv(Xi), yi)), i {1, ..., b}. 4: Get Ei q = fp(f r l (yi)), i {1, ..., b}. 5: Compute Si,j = s LI(Ei q, Ej v). 6: Get negative image index ˆi for each yi: ˆi = arg maxj {1,...,b},j =i(Si,j) 7: Gradient update using loss function Eq.(2), where Ej v = Eˆi. 8: end while
Open Source Code No The paper does not provide an explicit statement or link to their specific source code implementation for SV-RAG. It mentions using open-source models as backbones (Pali Gemma, Phi-3-v, Intern VL2) and tools like Paddle-OCR, but not their own code.
Open Datasets Yes We collect a visually-rich document QA dataset, Vis R-Bench, comprising nine domains including magazine, flyer, newsletter, product manual, and presentations, etc. This dataset is built upon web-crawl documents, containing 226 documents and 471 question answer pairs. We evaluated our method s performance on four public datasets Slide VQA, MMLong Bench-Doc (Ma et al., 2024d), Doc VQA (Mathew et al., 2021), and DUDE (Van Landeghem et al., 2023) along with our proposed Vis R-Bench dataset. The Vis R-Bench dataset was curated with careful consideration of ethical and legal concerns. All documents are sourced from publicly available data with licenses explicitly permitting research use. To ensure data integrity and compliance, we provide links to the original sources instead of distributing the documents.
Dataset Splits Yes We fine-tuned our QA modules using the training split of the Slide VQA dataset (Tanaka et al., 2023). The Slide VQA dataset contains 1,919 slides in the training set, 400 in the test set, and 300 in the development set, with each slide consisting of 20 pages. The training split includes 10,290 samples, each annotated with questions, answers, and corresponding evidence. For Doc VQA, we used 5,349 SP and 5,187 MP QA pairs from the validation split. Similarly, we combined the test and dev splits of Slide VQA to form 2,995 SP and 763 MP QA pairs for evaluation. For DUDE, we evaluated 6,307 QA pairs from the validation split.
Hardware Specification Yes All experiments are implemented with PyTorch and conducted on Nvidia A100 GPUs. In datasets like MMLong Bench-Doc and Doc VQA, some documents exceed hundreds of pages, causing out-of-memory errors, even on servers with 8 A100 (80GB) GPUs.
Software Dependencies No The paper mentions 'All experiments are implemented with PyTorch' and 'using the Adam W optimizer' and 'GPT-4o (API version 2024-02-15-preview)', but specific version numbers are not provided for PyTorch or AdamW. While an API version is given for GPT-4o, other core dependencies lack specific versioning.
Experiment Setup Yes The Col-retrieval modules are fine-tuned for 4 epochs with a batch size of 32 and a learning rate of 5e-5, using the Adam W optimizer and Lo RA adapters on all linear layers in the LLM. The Lo RA rank is set to 32.