The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

Authors: HONG LI, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that current open-source MLLMs consistently have a weak ability in association tasks, and even the closed-source GPT4-V Achiam et al. (2023) and Gemini-1.5Flash Reid et al. (2024) also have a huge gap from human performance. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans.
Researcher Affiliation Academia Hong Li, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li Shanghai Jiao Tong University, Shanghai Innovation Institute EMAIL
Pseudocode Yes Algorithm 1 Synchronous/Asynchronous Association Evaluation
Open Source Code Yes Our data and code are available at: https://mvig-rhos.com/llm_inception.
Open Datasets Yes Specifically, we utilize the annotationfree construction method proposed in Section 3 on the Object Concept Learning (OCL) Li et al. (2023) to generate an attribute and affordance association datasets, and on the Pangea Li et al. (2024c) to generate action association dataset 1. To further demonstrate the capability on verb concept, we further implemented on action HMDB Kuehne et al. (2011) datasets.
Dataset Splits No The paper describes how samples are chosen for each step of an association task (e.g., 'randomly select one sample, xquery, as the initial query image. Then, for each step t, We randomly choose candidate samples'), but does not explicitly define standard training, validation, or test dataset splits with percentages or counts for the overall benchmark evaluation.
Hardware Specification Yes In experiments, open-source MLLM is run on a single NVIDIA A100 80G GPU.
Software Dependencies No The paper mentions various MLLMs (e.g., 'QWen-VL', 'LLa VA-Ne XT-7B', 'GPT4-V') that were evaluated, but does not provide specific version numbers for the ancillary software or libraries used to implement the benchmark itself.
Experiment Setup Yes In the experiment, we set the repetition weight wr and forgetting decrement df are 1.0 and 0.2 for memory base attention in Struct M and NLM in all cases, respectively.