A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Authors: Minjie Zhu, Yichen Zhu, Ning Liu, Xin Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our thorough empirical research leads us to several findings that diverge from the conventional wisdom established by prior studies on Multimodal Large Language Models. We evaluate our model variants on academic-task-oriented benchmarks (left) as well as instruction-following benchmarks (middle). Experimental Results on Visual Question Answering Benchmarks. We evaluate the visual question answering abilities on VQAv2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), Viz Wiz (Gurari et al. 2018), SQAI (Lu et al. 2022) and VQAT (Singh, Natarajan et al. 2019). As shown in Table 2, Mipha-3B achieves the highest performance in 2 out of the 5 benchmarks.
Researcher Affiliation Collaboration Minjie Zhu1*, Yichen Zhu2* , Ning Liu2, Xin Liu1, Zhiyuan Xu2, Chaomin Shen1 , Yaxin Peng3 1 East China Normal University 2 Midea Group 3 Shanghai University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/zhuyiche/llava-phi
Open Datasets Yes We conduct empirical studies on a collection of both academic-task-oriented benchmarks and recent benchmarks specifically proposed for instruction-following MLLMs, totaling 8 benchmarks. For academic-task-oriented benchmarks, VQAv2 (Goyal et al. 2017) and GQA (Hudson and Manning 2019) evaluate the model s visual perception capabilities on open-ended short answers. Science QA (Lu et al. 2022) is used to evaluate the zero-shot generalization on scientific question answering. Text VQA (Singh, Natarajan et al. 2019) contains text-rich visual question answering. We also employ recent benchmarks proposed for instruction-following MLLMs. POPE (Li et al. 2023b) evaluates MLLM s degree of hallucination on three sampled subsets of COCO (Lin, Maire et al. 2014). MME (Fu et al. 2023) Benchmark evaluates MLLMs perception and cognition capabilities, and MMBench (Liu et al. 2023d) evaluates the model s answer robustness with allaround shuffling on multiple-choice answers. MM-Vet (Yu et al. 2023) evaluates MLLM s capabilities in engaging in visual conversations on a diverse range of tasks.
Dataset Splits No The paper evaluates on various benchmarks like VQAv2, GQA, etc., but does not explicitly provide details on how the datasets were split into training, validation, and test sets within the text. It relies on the implied standard splits of these datasets without stating them explicitly.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For the Lo RA setup, we configure r to be 128 and α to be 256.