reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Authors: Minjie Zhu, Yichen Zhu, Ning Liu, Xin Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our thorough empirical research leads us to several findings that diverge from the conventional wisdom established by prior studies on Multimodal Large Language Models. We evaluate our model variants on academic-task-oriented benchmarks (left) as well as instruction-following benchmarks (middle). Experimental Results on Visual Question Answering Benchmarks. We evaluate the visual question answering abilities on VQAv2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018), SQAI (Lu et al. 2022) and VQAT (Singh, Natarajan et al. 2019). As shown in Table 2, Mipha-3B achieves the highest performance in 2 out of the 5 benchmarks.
Researcher Affiliation	Collaboration	Minjie Zhu1, Yichen Zhu2 , Ning Liu2, Xin Liu1, Zhiyuan Xu2, Chaomin Shen1 , Yaxin Peng3 1 East China Normal University 2 Midea Group 3 Shanghai University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/zhuyiche/llava-phi
Open Datasets	Yes	We conduct empirical studies on a collection of both academic-task-oriented benchmarks and recent benchmarks specifically proposed for instruction-following MLLMs, totaling 8 benchmarks. For academic-task-oriented benchmarks, VQAv2 (Goyal et al. 2017) and GQA (Hudson and Manning 2019) evaluate the model s visual perception capabilities on open-ended short answers. Science QA (Lu et al. 2022) is used to evaluate the zero-shot generalization on scientific question answering. Text VQA (Singh, Natarajan et al. 2019) contains text-rich visual question answering. We also employ recent benchmarks proposed for instruction-following MLLMs. POPE (Li et al. 2023b) evaluates MLLM s degree of hallucination on three sampled subsets of COCO (Lin, Maire et al. 2014). MME (Fu et al. 2023) Benchmark evaluates MLLMs perception and cognition capabilities, and MMBench (Liu et al. 2023d) evaluates the model s answer robustness with allaround shuffling on multiple-choice answers. MM-Vet (Yu et al. 2023) evaluates MLLM s capabilities in engaging in visual conversations on a diverse range of tasks.
Dataset Splits	No	The paper evaluates on various benchmarks like VQAv2, GQA, etc., but does not explicitly provide details on how the datasets were split into training, validation, and test sets within the text. It relies on the implied standard splits of these datasets without stating them explicitly.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	For the Lo RA setup, we configure r to be 128 and α to be 256.