Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 17 vision-language (VL) tasks, which show that LLa VA-HR outperforms existing MLLMs on 15 VL tasks, e.g., +5.2% on Text VQA. More importantly, both training and inference of LLa VA-HR remain efficient with MRA, e.g., 20 training hours and faster inference speed than LLa VA-Ne XT.
Researcher Affiliation Academia 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Open GVLab, Shanghai AI Laboratory.
Pseudocode No The paper describes the methodology using prose and mathematical equations in sections like "4 MIXTURE-OF-RESOLUTION ADAPTATION" and "4.3 MIXTURE-OF-RESOLUTION ADAPTER," but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Source codes are released at: LLa VA-HR. The paper states 'Source codes are released at: LLa VA-HR.' in the abstract, but does not provide a direct, actionable link (URL) or specific repository name for the code.
Open Datasets Yes We evaluate LLa VA-HR on six emerging multimodal benchmarks for MLLMs, including MME (Fu et al., 2023), POPE (Li et al., 2023c), SEED (Li et al., 2023a), MM-VET (Yu et al., 2023b), MMMU (Yue et al., 2023) and Math Vista (Lu et al., 2023). We also evaluate LLa VA-HR on seven VL datasets, including VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), OKVQA (Marino et al., 2019), OCRVQA (Mishra et al., 2019), Science QA (Lu et al., 2022a), Viz Wiz (Gurari et al., 2018) and Text VQA (Singh et al., 2019).
Dataset Splits Yes We report the accuracy on the test set of OCRVQA, the test set of Viz Wiz, and the val set of OKVQA.
Hardware Specification Yes The training and inference costs are measured on NVIDIA A800s. In particular, the pre-training and instruction tuning of LLa VA-HR (7B, 1,024 1,024) only take a total of 20.7 hours on 8 A800 GPUs
Software Dependencies No The paper mentions models like CLIP-ViT-L and CLIP-ConvNeXt-L, and an optimizer AdamW, but it does not specify versions for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Adam W (Kingma & Ba, 2014) is used as the optimizer, and the learning rate and batch size are set to 1e-3 and 256, respectively. Visual resolutions are set to 336 336 and 384 384 for the Vi T and the CNN, respectively. ... At this stage, the entire model is updated with a learning rate of 2e-5. Besides, we increase the resolution of Vi T and CNN to 448 448 and 1,024 1,024, respectively. The training epoch is set to 1 for pre-training and instruction tuning.