Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering

Authors: Peize Li, Qingyi Si, Peng Fu, Zheng Lin, Yan Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.
Researcher Affiliation Collaboration 1School of Artificial Intelligence, Jilin University, Changchun, China 2Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 3Huawei Technologies Co., Ltd., Beijing, China 4School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 5Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We evaluate our approach on the retrieval-based multi-image QA dataset (Penamakuri et al. 2023), which contains 334K samples for training, 41K for validation and 41K for testing. The questions cover various types, including color, shape, counting, object attributes and relations.
Dataset Splits Yes We evaluate our approach on the retrieval-based multi-image QA dataset (Penamakuri et al. 2023), which contains 334K samples for training, 41K for validation and 41K for testing.
Hardware Specification Yes We conduct our experiments using an NVIDIA 3090 24GB GPU.
Software Dependencies No The paper mentions specific models like m PLUG-Owl2 and QAG, and uses CLIP with RN50, but does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used for implementation (e.g., Python, PyTorch version).
Experiment Setup Yes We train the network for 20 epochs with the batch size of 100 and an initial learning rate of 1e-4, using Adam W optimizer.