Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
Authors: Xiaofeng Zhang, Fanshuo Zeng, Yihao Quan, Zheng Hui, Jiawei Yao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. Our experiments were performed on a single 4090D GPU. |
| Researcher Affiliation | Academia | 1Shanghai Jiaotong University 2 Institute of Automation, Chinese Academy of Sciences 3 Beijing Jiaotong University 4 Columbia University 5 University of Washington |
| Pseudocode | No | The paper describes algorithms (Simignore algorithm, image-text token filtering algorithm) and their steps in paragraph form, but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/Fanshuo Zeng/Simignore |
| Open Datasets | Yes | The Science QA (Lu et al. 2022) dataset is currently the only dataset available for complex reasoning and contains 21,208 Q&A multiple-choice questions from elementary and middle school science curricula. |
| Dataset Splits | No | The paper mentions using the Science QA dataset but does not provide specific details on training, validation, or test splits. It only states the dataset contains '21,208 Q&A multiple-choice questions'. |
| Hardware Specification | Yes | Our experiments were performed on a single 4090D GPU. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | Our experiments were performed on a single 4090D GPU. ... As shown in Table 1, we show the results for the different baseline models as well as for the models to which our method is applied. ... Table 2: Accuracy and runtime of LLM when ignoring different numbers of image tokens(baseline: LLa VA1.5-7B). |