DocVXQA: Context-Aware Visual Explanations for Document Question Answering

Authors: Mohamed Ali Souibgui, Changkyu Choi, Andrey Barsky, Kangsoo Jung, Ernest Valveny, Dimosthenis Karatzas

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. ... Section 4. Experiments and Results
Researcher Affiliation Academia 1Computer Vision Center, Universitat Aut onoma de Barcelona, Spain 2Ui T The Arctic University of Norway, Norway 3Inria, France.
Pseudocode No The paper describes the methodology using text and mathematical formulations (e.g., objective function in Equation 3) and flowcharts (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/ dali92002/Doc VXQA.
Open Datasets Yes The experiments are done on two datasets, Doc VQA (Mathew et al., 2021) and PFL-Doc VQA (Tito et al., 2024).
Dataset Splits No The paper states that experiments are done on Doc VQA and PFL-Doc VQA datasets, and refers to "fine-tuned Pix2Struct to predict the answer" and
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies No The paper mentions using Pix2Struct and an ADAMW optimizer but does not specify version numbers for any software libraries, programming languages, or other dependencies.
Experiment Setup Yes HYPERPARAMETER VALUE LEARNING RATE 1 10 7 BATCH SIZE 5 OPTIMIZER ADAMW γ 0.5 β 5 THRESHOLD (k) FOR POSTPROCESSING 3