reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Multi-modal Long Context Learning for Training-free Adaptation

Authors: Zehong Ma, Shiliang Zhang, Longhui Wei, Qi Tian

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on diverse vision-language benchmarks demonstrate that EMLo C achieves performance on par with or superior to naive long-context approaches. Our results highlight the potential of EMLo C as a groundbreaking framework for efficient and flexible adaptation of multi-modal models in resource-constrained environments.
Researcher Affiliation	Collaboration	1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Peng Cheng Laboratory, Shenzhen, China 3Huawei Inc. 4Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ). Correspondence to: Shiliang Zhang <EMAIL>.
Pseudocode	Yes	Detailed pseudocode is provided in Appendix A. The procedure proceeds as follows: ... Algorithm 1: Layer-wise Adaptive Pruning
Open Source Code	Yes	Codes are publicly available at https://github.com/Zehong Ma/EMLo C.
Open Datasets	Yes	Evaluation Dataset. We evaluate our EMLo C on six challenging benchmarks: Image Net100, a subset of Image Net1k (Deng et al., 2009) with the first 100 classes for recognition, Screen Spot for cross-platform GUI grounding, MMERW for real-world multimodal tasks, Illusion VQA for illusion understanding, OK-VQA for knowledge-based QA, and You Cook2 for video understanding.
Dataset Splits	Yes	For datasets without predefined validation splits, we randomly sample 100 test examples for evaluation. ... Demonstration examples are uniformly sampled from the training set, ensuring even distribution per class. For instance, in the 200-example setting, each class contributes two examples. Evaluation is conducted on the full validation set with 5000 images.
Hardware Specification	Yes	Experiments are conducted on NVIDIA L20 GPUs with 48GB of memory. Inference time is measured with a batch size of 1 on one GPU.
Software Dependencies	No	The paper mentions specific software names like 'Qwen2-VL', 'Deep Speed Ze RO-3', and 'LLa MAFactory' but does not provide version numbers for these or other software components like programming languages or libraries.
Experiment Setup	Yes	The default JS divergence threshold δ is set to 0.005, and the chunk size is 1.6k. The retention ratio set R is [0.1, 0.2, 0.5, 1.0]. ... In Lo RA adaptation, we apply Lo RA adapters to all linear modules of the LLM, including qkv proj, out proj, up proj, and down proj, while keeping the vision encoder and multi-modal projector frozen. The rank and alpha are set to 16 and 32, respectively. ... The detailed hyperparameters are reported in Table 15 and Table 16.