reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning

Authors: Jun Gao, Qian Qiao, Tianxiang Wu, Zili Wang, Ziqiang Cao, Wenjie Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	AIM is comprehensively evaluated on image caption, VQA, and hateful speech detection. Outstanding results reveal that AIM provides an efficient and effective solution in upgrading MLLMs for multimodal ICL. Figure 1: Memory cost comparison between LLaVA-Next, Open Flamingo, and AIM on Flickr30k. The memory cost of LLaVA-Next and Open Flamingo occurs a surge, while it almost remains unchanged in AIM. Figure 2: Performance comparison between AIM and its underlying backbone in the 16-shot ICL setting.
Researcher Affiliation	Academia	1 School of Computer Science and Technology, Soochow University, China 2 Independent Researcher 3 Computation Department, The Hong Kong Polytechnic University, Hong Kong EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using descriptive text and mathematical equations, such as Equations (1), (2), (3), and (4), and illustrates the architecture in Figure 4. However, it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/jungao1106/AIM.
Open Datasets	Yes	We comprehensively evaluate AIM on image caption (Plummer et al. 2015), visual question answering (VQA) (Gurari et al. 2018; Marino et al. 2019), and hateful detection (Kiela et al. 2020)... We train the adapter on the subset of MMC4 (Zhu et al. 2023)... Table 2: Details of involved evaluating datasets. Flickr30k... OKVQA... Vizwiz... Hateful Memes...
Dataset Splits	Yes	We briefly illustrate the involved dataset of AIM in Table 2. The test data was carefully filtered to exclude datasets that were encountered during the training of backbone MLLMs according to their technical reports, aiming to obtain more reliable ICL results. Table 2: Details of involved evaluating datasets. Dataset Training # Instances Eval. Set Metric Flickr30k 1000 Test CIDEr OKVQA 5046 Val VQA acc. Vizwiz 4319 Val VQA acc. Hateful Memes 815 Test ROC AUC
Hardware Specification	Yes	Additionally, we conduct all experiments on a single DGX node with 8*Nvidia H800 GPUs.
Software Dependencies	No	The paper mentions specific models and techniques used (e.g., QWen-VL, LLaVA-Next, Flash Attention, LoRA) but does not provide a list of ancillary software with specific version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	During training, we set the maximum number of pictures to 5 per step for efficiency... We fix the learning rate to 3e-5 and use Adam as the optimizer, and the effective batch size is 16 (4 GPUs data parallelism and 4 steps gradient accumulation). The number of epochs is set to 10 and we get a checkpoint per 3400 training steps.