reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

Authors: HONG LI, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that current open-source MLLMs consistently have a weak ability in association tasks, and even the closed-source GPT4-V Achiam et al. (2023) and Gemini-1.5Flash Reid et al. (2024) also have a huge gap from human performance. Our systematic investigation shows that current open-source MLLMs consistently exhibit poor capability in our association tasks, even the currently state-of-the-art GPT-4V(vision) also has a significant gap compared to humans.
Researcher Affiliation	Academia	Hong Li, Nanxi Li, Yuanjie Chen, Jianbin Zhu, Qinlu Guo, Cewu Lu, Yong-Lu Li Shanghai Jiao Tong University, Shanghai Innovation Institute EMAIL
Pseudocode	Yes	Algorithm 1 Synchronous/Asynchronous Association Evaluation
Open Source Code	Yes	Our data and code are available at: https://mvig-rhos.com/llm_inception.
Open Datasets	Yes	Specifically, we utilize the annotationfree construction method proposed in Section 3 on the Object Concept Learning (OCL) Li et al. (2023) to generate an attribute and affordance association datasets, and on the Pangea Li et al. (2024c) to generate action association dataset 1. To further demonstrate the capability on verb concept, we further implemented on action HMDB Kuehne et al. (2011) datasets.
Dataset Splits	No	The paper describes how samples are chosen for each step of an association task (e.g., 'randomly select one sample, xquery, as the initial query image. Then, for each step t, We randomly choose candidate samples'), but does not explicitly define standard training, validation, or test dataset splits with percentages or counts for the overall benchmark evaluation.
Hardware Specification	Yes	In experiments, open-source MLLM is run on a single NVIDIA A100 80G GPU.
Software Dependencies	No	The paper mentions various MLLMs (e.g., 'QWen-VL', 'LLa VA-Ne XT-7B', 'GPT4-V') that were evaluated, but does not provide specific version numbers for the ancillary software or libraries used to implement the benchmark itself.
Experiment Setup	Yes	In the experiment, we set the repetition weight wr and forgetting decrement df are 1.0 and 0.2 for memory base attention in Struct M and NLM in all cases, respectively.