Efficient Multi-modal Long Context Learning for Training-free Adaptation
Authors: Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on diverse vision-language benchmarks demonstrate that EMLo C achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLo C as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Peng Cheng Laboratory, Shenzhen, China 3Huawei Inc. 4Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ). Correspondence to: Shiliang Zhang <EMAIL>. |
| Pseudocode | Yes | Detailed pseudocode is provided in Appendix A. The procedure proceeds as follows: ... Algorithm 1: Layer-wise Adaptive Pruning |
| Open Source Code | Yes | Codes are publicly available at https://github.com/Zehong Ma/EMLo C. |
| Open Datasets | Yes | Evaluation Dataset. We evaluate our EMLo C on six challenging benchmarks: Image Net100, a subset of Image Net1k (Deng et al., 2009) with the first 100 classes for recognition, Screen Spot for cross-platform GUI grounding, MMERW for real-world multimodal tasks, Illusion VQA for illusion understanding, OK-VQA for knowledge-based QA, and You Cook2 for video understanding. |
| Dataset Splits | Yes | For datasets without predefined validation splits, we randomly sample 100 test examples for evaluation. ... Demonstration examples are uniformly sampled from the training set, ensuring even distribution per class. For instance, in the 200-example setting, each class contributes two examples. Evaluation is conducted on the full validation set with 5000 images. |
| Hardware Specification | Yes | Experiments are conducted on NVIDIA L20 GPUs with 48GB of memory. Inference time is measured with a batch size of 1 on one GPU. |
| Software Dependencies | No | The paper mentions specific software names like 'Qwen2-VL', 'Deep Speed Ze RO-3', and 'LLa MAFactory' but does not provide version numbers for these or other software components like programming languages or libraries. |
| Experiment Setup | Yes | The default JS divergence threshold δ is set to 0.005, and the chunk size is 1.6k. The retention ratio set R is [0.1, 0.2, 0.5, 1.0]. ... In Lo RA adaptation, we apply Lo RA adapters to all linear modules of the LLM, including qkv proj, out proj, up proj, and down proj, while keeping the vision encoder and multi-modal projector frozen. The rank and alpha are set to 16 and 32, respectively. ... The detailed hyperparameters are reported in Table 15 and Table 16. |