reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 17 vision-language (VL) tasks, which show that LLa VA-HR outperforms existing MLLMs on 15 VL tasks, e.g., +5.2% on Text VQA. More importantly, both training and inference of LLa VA-HR remain efficient with MRA, e.g., 20 training hours and faster inference speed than LLa VA-Ne XT.
Researcher Affiliation	Academia	1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China. 2Open GVLab, Shanghai AI Laboratory.
Pseudocode	No	The paper describes the methodology using prose and mathematical equations in sections like "4 MIXTURE-OF-RESOLUTION ADAPTATION" and "4.3 MIXTURE-OF-RESOLUTION ADAPTER," but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Source codes are released at: LLa VA-HR. The paper states 'Source codes are released at: LLa VA-HR.' in the abstract, but does not provide a direct, actionable link (URL) or specific repository name for the code.
Open Datasets	Yes	We evaluate LLa VA-HR on six emerging multimodal benchmarks for MLLMs, including MME (Fu et al., 2023), POPE (Li et al., 2023c), SEED (Li et al., 2023a), MM-VET (Yu et al., 2023b), MMMU (Yue et al., 2023) and Math Vista (Lu et al., 2023). We also evaluate LLa VA-HR on seven VL datasets, including VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), OKVQA (Marino et al., 2019), OCRVQA (Mishra et al., 2019), Science QA (Lu et al., 2022a), Viz Wiz (Gurari et al., 2018) and Text VQA (Singh et al., 2019).
Dataset Splits	Yes	We report the accuracy on the test set of OCRVQA, the test set of Viz Wiz, and the val set of OKVQA.
Hardware Specification	Yes	The training and inference costs are measured on NVIDIA A800s. In particular, the pre-training and instruction tuning of LLa VA-HR (7B, 1,024 1,024) only take a total of 20.7 hours on 8 A800 GPUs
Software Dependencies	No	The paper mentions models like CLIP-ViT-L and CLIP-ConvNeXt-L, and an optimizer AdamW, but it does not specify versions for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Adam W (Kingma & Ba, 2014) is used as the optimizer, and the learning rate and batch size are set to 1e-3 and 256, respectively. Visual resolutions are set to 336 336 and 384 384 for the Vi T and the CNN, respectively. ... At this stage, the entire model is updated with a learning rate of 2e-5. Besides, we increase the resolution of Vi T and CNN to 448 448 and 1,024 1,024, respectively. The training epoch is set to 1 for pre-training and instruction tuning.