Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on 17 vision-language (VL) tasks, which show that LLa VA-HR outperforms existing MLLMs on 15 VL tasks, e.g., +5.2% on Text VQA. More importantly, both training and inference of LLa VA-HR remain efficient with MRA, e.g., 20 training hours and faster inference speed than LLa VA-Ne XT. |
| Researcher Affiliation | Academia | 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Open GVLab, Shanghai AI Laboratory. |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations in sections like "4 MIXTURE-OF-RESOLUTION ADAPTATION" and "4.3 MIXTURE-OF-RESOLUTION ADAPTER," but it does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Source codes are released at: LLa VA-HR. The paper states 'Source codes are released at: LLa VA-HR.' in the abstract, but does not provide a direct, actionable link (URL) or specific repository name for the code. |
| Open Datasets | Yes | We evaluate LLa VA-HR on six emerging multimodal benchmarks for MLLMs, including MME (Fu et al., 2023), POPE (Li et al., 2023c), SEED (Li et al., 2023a), MM-VET (Yu et al., 2023b), MMMU (Yue et al., 2023) and Math Vista (Lu et al., 2023). We also evaluate LLa VA-HR on seven VL datasets, including VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), OKVQA (Marino et al., 2019), OCRVQA (Mishra et al., 2019), Science QA (Lu et al., 2022a), Viz Wiz (Gurari et al., 2018) and Text VQA (Singh et al., 2019). |
| Dataset Splits | Yes | We report the accuracy on the test set of OCRVQA, the test set of Viz Wiz, and the val set of OKVQA. |
| Hardware Specification | Yes | The training and inference costs are measured on NVIDIA A800s. In particular, the pre-training and instruction tuning of LLa VA-HR (7B, 1,024 1,024) only take a total of 20.7 hours on 8 A800 GPUs |
| Software Dependencies | No | The paper mentions models like CLIP-ViT-L and CLIP-ConvNeXt-L, and an optimizer AdamW, but it does not specify versions for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Adam W (Kingma & Ba, 2014) is used as the optimizer, and the learning rate and batch size are set to 1e-3 and 256, respectively. Visual resolutions are set to 336 336 and 384 384 for the Vi T and the CNN, respectively. ... At this stage, the entire model is updated with a learning rate of 2e-5. Besides, we increase the resolution of Vi T and CNN to 448 448 and 1,024 1,024, respectively. The training epoch is set to 1 for pre-training and instruction tuning. |