Efficient Multi-modal Long Context Learning for Training-free Adaptation

Authors: Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse vision-language benchmarks demonstrate that EMLo C achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLo C as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments.
Researcher Affiliation Collaboration 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Peng Cheng Laboratory, Shenzhen, China 3Huawei Inc. 4Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ). Correspondence to: Shiliang Zhang <EMAIL>.
Pseudocode Yes Detailed pseudocode is provided in Appendix A. The procedure proceeds as follows: ... Algorithm 1: Layer-wise Adaptive Pruning
Open Source Code Yes Codes are publicly available at https://github.com/Zehong Ma/EMLo C.
Open Datasets Yes Evaluation Dataset. We evaluate our EMLo C on six challenging benchmarks: Image Net100, a subset of Image Net1k (Deng et al., 2009) with the first 100 classes for recognition, Screen Spot for cross-platform GUI grounding, MMERW for real-world multimodal tasks, Illusion VQA for illusion understanding, OK-VQA for knowledge-based QA, and You Cook2 for video understanding.
Dataset Splits Yes For datasets without predefined validation splits, we randomly sample 100 test examples for evaluation. ... Demonstration examples are uniformly sampled from the training set, ensuring even distribution per class. For instance, in the 200-example setting, each class contributes two examples. Evaluation is conducted on the full validation set with 5000 images.
Hardware Specification Yes Experiments are conducted on NVIDIA L20 GPUs with 48GB of memory. Inference time is measured with a batch size of 1 on one GPU.
Software Dependencies No The paper mentions specific software names like 'Qwen2-VL', 'Deep Speed Ze RO-3', and 'LLa MAFactory' but does not provide version numbers for these or other software components like programming languages or libraries.
Experiment Setup Yes The default JS divergence threshold δ is set to 0.005, and the chunk size is 1.6k. The retention ratio set R is [0.1, 0.2, 0.5, 1.0]. ... In Lo RA adaptation, we apply Lo RA adapters to all linear modules of the LLM, including qkv proj, out proj, up proj, and down proj, while keeping the vision encoder and multi-modal projector frozen. The rank and alpha are set to 16 and 32, respectively. ... The detailed hyperparameters are reported in Table 15 and Table 16.