AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning
Authors: Jun Gao, Qian Qiao, Tianxiang Wu, Zili Wang, Ziqiang Cao, Wenjie Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | AIM is comprehensively evaluated on image caption, VQA, and hateful speech detection. Outstanding results reveal that AIM provides an efficient and effective solution in upgrading MLLMs for multimodal ICL. Figure 1: Memory cost comparison between LLaVA-Next, Open Flamingo, and AIM on Flickr30k. The memory cost of LLaVA-Next and Open Flamingo occurs a surge, while it almost remains unchanged in AIM. Figure 2: Performance comparison between AIM and its underlying backbone in the 16-shot ICL setting. |
| Researcher Affiliation | Academia | 1 School of Computer Science and Technology, Soochow University, China 2 Independent Researcher 3 Computation Department, The Hong Kong Polytechnic University, Hong Kong EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using descriptive text and mathematical equations, such as Equations (1), (2), (3), and (4), and illustrates the architecture in Figure 4. However, it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/jungao1106/AIM. |
| Open Datasets | Yes | We comprehensively evaluate AIM on image caption (Plummer et al. 2015), visual question answering (VQA) (Gurari et al. 2018; Marino et al. 2019), and hateful detection (Kiela et al. 2020)... We train the adapter on the subset of MMC4 (Zhu et al. 2023)... Table 2: Details of involved evaluating datasets. Flickr30k... OKVQA... Vizwiz... Hateful Memes... |
| Dataset Splits | Yes | We briefly illustrate the involved dataset of AIM in Table 2. The test data was carefully filtered to exclude datasets that were encountered during the training of backbone MLLMs according to their technical reports, aiming to obtain more reliable ICL results. Table 2: Details of involved evaluating datasets. Dataset Training # Instances Eval. Set Metric Flickr30k 1000 Test CIDEr OKVQA 5046 Val VQA acc. Vizwiz 4319 Val VQA acc. Hateful Memes 815 Test ROC AUC |
| Hardware Specification | Yes | Additionally, we conduct all experiments on a single DGX node with 8*Nvidia H800 GPUs. |
| Software Dependencies | No | The paper mentions specific models and techniques used (e.g., QWen-VL, LLaVA-Next, Flash Attention, LoRA) but does not provide a list of ancillary software with specific version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | During training, we set the maximum number of pictures to 5 per step for efficiency... We fix the learning rate to 3e-5 and use Adam as the optimizer, and the effective batch size is 16 (4 GPUs data parallelism and 4 steps gradient accumulation). The number of epochs is set to 10 and we get a checkpoint per 3400 training steps. |